Use this URL to cite or link to this record in EThOS:
Title: A tree-based measure for hierarchical data in mixed databases
Author: Hassan, Diman
ISNI:       0000 0004 6060 5992
Awarding Body: University of Nottingham
Current Institution: University of Nottingham
Date of Award: 2016
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Restricted access.
Access from Institution:
The structure of the data in a mixed database can be a barrier when clustering that database into meaningful groups. A hierarchically structured database necessitates efficient distance measures and clustering algorithms to locate similarities between data objects. Therefore, existing literature proposes hierarchical distance measures to measure the similarities between the records in hierarchical databases. The main contribution of this research is to create and test a new distance measure for large hierarchical databases consisting of mixed data types and attributes, based on an existing tree-based (hierarchical) distance metric, the pq-gram distance metric. Several aims and objectives were pursued to fill a number of gaps in the current body of knowledge. One of these goals was to verify the validity of the pq-gram distance metric when applied to different data sets, and to compare and combine it with a number of different distance measures to demonstrate its usefulness across large mixed databases. To achieve this, further work focused on exploring how to exploit the existing method as a measure of hierarchical data attributes in mixed data sets, and to ascertain whether the new method would produce better results with large mixed databases. For evaluation purposes, the pq-gram metric was applied to The Health Improvement Network (THIN) database to determine if it could identify similarities between the records in the database. After this, it was applied to mixed data to examine different distance measures, which include non-hierarchical and other hierarchical measures, and to combine them to create a Combined Distance Function (CDF). The CDF improved the results when applied to different data sets, such as the hierarchical National Bureau of Economic Research of United States (NBER US) Patent data set and the mixed (THIN) data set. The CDF was then modified to create a New-CDF, which used only the hierarchical pq-gram metric to measure the hierarchical attributes in the mixed data set. The New-CDF worked well, finding the most similar data records when applied to the THIN data set, and grouping them in one cluster using the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) clustering algorithm. The quality of the clusters was explored using two internal validation indices, Silhouette and C-Index, where the values showed good compactness and quality of the clusters obtained using the new method.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA 75 Electronic computers. Computer science