Use this URL to cite or link to this record in EThOS:
Title: Schema mapping generation for autonomous data sources
Author: Mazilu, Mihaela
ISNI:       0000 0005 0290 6348
Awarding Body: University of Manchester
Current Institution: University of Manchester
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Access from Institution:
The world is producing digital data at a rapid pace and this led to a new way of seeing data as Big Data. Big Data refers to large numbers of large datasets that typically include complex, various, rapidly-changing data of uncertain quality that (usually) need preprocessing before analysis. Data integration over Big Data gives rise to the challenge of correlating unruly and heterogeneous repositories of data sources. In this thesis, our focus is on integration techniques for Big Data, more specifically on generating mappings over large repositories of heterogeneous and autonomous datasets. A schema mapping generation algorithm constructs views for populating a target database schema from source schemas. We have designed, developed, and validated techniques for generating schema mappings over autonomous data sources for which scant information is available, and for complex multi-relation, constrained target schemas, at scale. Our proposed algorithm is called Dynamap and has at its core the dynamic programming paradigm for performing the search over the space of mappings. The mappings are built in a bottom-up fashion, where the merge operators are chosen based on profiling information on the sources, i.e., candidate keys and (partial) inclusion dependencies. We have employed Dynamap in three main types of experiments: (i) with the state-of-the-art integration scenario generator, thus, showing that it can handle scenarios that are expected to be tackled by mapping generation algorithms; (ii) with variations of real-world scenarios that come from different domains with autonomous sources, showing that it can handle integration problems from real datasets; and (iii) with stress-test scenarios showing that our algorithm can handle scenarios where the input comprises hundreds of data sources.
Supervisor: Paton, Norman ; Fernandes, Alvaro Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: data integration ; autonomous data sources ; big data ; profiling data ; dynamic programming ; schema mapping generation ; relational data