Use this URL to cite or link to this record in EThOS:
Title: Cost-effective data wrangling in data lakes
Author: Bogatu, Alex
ISNI:       0000 0004 8506 984X
Awarding Body: University of Manchester
Current Institution: University of Manchester
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Access from Institution:
Data analytics stands to benefit from the increased availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data preparation, also known as data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness and heterogeneity of data lakes, obtaining value from such targets often requires significant prior effort in preparing the data for analysis. For example, data wrangling is reported to take as much as 80% of the time of data scientists. The issue then arises of how to decrease this cost. This thesis investigates what makes data preparation costly and how data preparation can become more cost-effective through automation. Specifically, this thesis inquires into two challenges that have been insufficiently covered by the state-of-the-art, viz., how to automatically pull out of the data lake those datasets that might contribute to wrangling out a given target, and how to automatically homogenise the representation of their instance value. We refer to the former as the problem of dataset discovery and to the latter as the problem of format transformation. This thesis contributes effective and efficient solutions to both problems. The work described in this thesis should be of interest to researchers and professionals in the areas of data analysis and data wrangling, who, in the process of preparing the data for analysis, confront themselves with heterogeneously represented data originating from many autonomous sources.
Supervisor: Paton, Norman ; Fernandes, Alvaro Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: data discovery ; format transformation ; data wrangling ; data preparation