Use this URL to cite or link to this record in EThOS:
Title: Crowdsourcing in pay-as-you-go data integration
Author: Osorno Gutierrez, Fernando
ISNI:       0000 0004 5920 0306
Awarding Body: University of Manchester
Current Institution: University of Manchester
Date of Award: 2016
Availability of Full Text:
Access from EThOS:
Access from Institution:
In pay-as-you-go data integration, feedback can inform the regeneration of different aspects of a data integration system, and as a result, helps to improve the system's quality. However, feedback could be expensive as the amount of feedback required to annotate all the possible integration artefacts is potentially big in contexts where the budget can be limited. Also, feedback could be used in different ways. Feedback of different types and in different orders could have different effects in the quality of the integration. Some feedback types could give rise to more benefit than others. There is a need to develop techniques to collect feedback effectively. Previous efforts have explored the benefit of feedback in one aspect of the integration. However, the contributions have not considered the benefit of different feedback types in a single integration task. We have investigated the annotation of mapping results using crowdsourcing, and implementing techniques for reliability. The results indicate that precision estimates derived from crowdsourcing improve rapidly, suggesting that crowdsourcing can be used as a cost-effective source of feedback. We propose an approach to maximize the improvement of data integration systems given a budget for feedback. Our approach takes into account the annotation of schema matchings, mapping results and pairs of candidate record duplicates. We define a feedback plan, which indicates the type of feedback to collect, the amount of feedback to collect and the order in which different types of feedback are collected. We defined a fitness function and a genetic algorithm to search for the most cost-effective feedback plans. We implemented a framework to test the application of feedback plans and measure the improvement of different data integration systems. In the framework, we use a greedy algorithm for the selection of mappings. We designed quality measures to estimate the quality of a dataspace after the application of a feedback plan. For the evaluation of our approach, we propose a method to generate synthetic data scenarios. We evaluate our approach in scenarios with different characteristics. The results showed that the generated feedback plans achieved higher quality values than the randomly generated feedback plans in several scenarios.
Supervisor: Paton, Norman ; Fernandes, Alvaro Sponsor: CONACYT, Mexico
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Data Integration ; Pay-As-You-Go ; Crowdsourcing ; User Feedback ; Mapping Annotation ; Mapping Selection ; Matching Annotation ; Entity Resolution ; Workflows ; Dataspaces ; Databases ; Amazon Mechanical Turk