Inferring information about correspondences between data sources for dataspaces
Traditional data integration offers high quality services for managing and querying interrelated but heterogeneous data sources but at a high cost. This is because a significant amount of manual effort is required to help specify precise relationships between the data sources in order to set up a data integration system. The recent proposed vision of dataspaces aims to reduce the upfront effort required to set up the system. A possible solution to approaching this aim is to infer schematic correspondences between the data sources, thus enabling the development of automated means for bootstrapping dataspaces. In this thesis, we discuss a two-step research programme to automatically infer schematic correspondences between data sources. In the first step, we investigate the effectiveness of existing schema matching approaches for inferring schematic correspondences and contribute a benchmark, called MatchBench, to achieve this aim. In the second step, we contribute an evolutionary search method to identify the set of entity-level relationships (ELRs) between data sources that qualify as entity-level schematic correspondences. Specifically, we model the requirements using a vector space model. For each resulting ELR we further identify a set of attribute-level relationships (ALRs) that qualify as attribute-level schematic correspondences. We demonstrate the effectiveness of the contributed inference technique using both MatchBench scenarios and real world scenarios.