Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.542761
Title: Definition and analysis of population-based data completeness measurement
Author: Emran, Nurul Akmar Binti
Awarding Body: University of Manchester
Current Institution: University of Manchester
Date of Award: 2011
Availability of Full Text:
Access from EThOS:
Access from Institution:
Abstract:
Poor quality data such as data with errors or missing values cause negative consequences in many application domains. An important aspect of data quality is completeness. One problem in data completeness is the problem of missing individuals in data sets. Within a data set, the individuals refer to the real world entities whose information is recorded. So far, in completeness studies however, there has been little discussion about how missing individuals are assessed. In this thesis, we propose the notion of population-based completeness (PBC) that deals with the missing individuals problem, with the aim of investigating what is required to measure PBC and to identify what is needed to support PBC measurements in practice. To achieve these aims, we analyse the elements of PBC and the requirements for PBC measurement, resulting in a definition of the PBC elements and PBC measurement formula. We propose an architecture for PBC measurement systems and determine the technical requirements of PBC systems in terms of software and hardware components. An analysis of the technical issues that arise in implementing PBC makes a contribution to an understanding of the feasibility of PBC measurements to provide accurate measurement results. Further exploration of a particular issue that was discovered in the analysis showed that when measuring PBC across multiple databases, data from those databases need to be integrated and materialised. Unfortunately, this requirement may lead to a large internal store for the PBC system that is impractical to maintain. We propose an approach to test the hypothesis that the available storage space can be optimised by materialising only partial information from the contributing databases, while retaining accuracy of the PBC measurements. Our approach involves substituting some of the attributes from the contributing databases with smaller alternatives, by exploiting the approximate functional dependencies (AFDs) that can be discovered within each local database. An analysis of the space-accuracy trade-offs of the approach leads to the development of an algorithm to assess candidate alternative attributes in terms of space-saving and accuracy (of PBC measurement). The result of several case studies conducted for proxy assessment contributes to an understanding of the space-accuracy trade-offs offered by the proxies. A better understanding of dealing with the completeness problem has been achieved through the proposal and the investigation of PBC, in terms of the requirements to measure and to support PBC in practice.
Supervisor: Embury, Suzanne ; Missier, Paolo Sponsor: Ministry of Higher Education Malaysia ; Universiti Teknikal Malaysia Melaka
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.542761  DOI: Not available
Keywords: completeness measurement ; data quality
Share: