Use this URL to cite or link to this record in EThOS:
Title: A novel clustering algorithm with a new similarity measure and ensemble methods for mixed data clustering
Author: Al Shaqsi, Jamil Darwish
ISNI:       0000 0004 2700 321X
Awarding Body: University of East Anglia
Current Institution: University of East Anglia
Date of Award: 2010
Availability of Full Text:
Access from EThOS:
This thesis addressed some specific issues in clustering: (1) clustering algorithms, (2) similarity measures, (3) number of clusters, K, and (4) clustering ensemble methods. Following on an in-depth review of clustering methods, a new three staged (3-Staged) clustering algorithm is proposed, with new three key aspects: (1) a new method for automatically estimating the K value, (2) a new similarity measure and (3) initiating the clustering process with a promising BASE. A BASE is a real sample that acts like a centroid or a medoid in common clustering methods but it is determined differently in our approach. A new similarity measure is defined particularly to reflect the degree of relative change between data samples, and more importantly to be able to accommodate numerical and categorical variables. We have proven mathematically that the proposed similarity measure meets the three properties of the metric measure. This research also investigated the problem of determining the appropriate number of clusters in a dataset and devised a novel function, which is integrated into our 3-Staged clustering algorithm, to automatically estimate the most appropriate number of clusters, K. Based on our new 3-Staged clustering algorithm, we developed two new ensemble algorithms. For all experiments, we used publicly available real-world benchmark datasets as these datasets have been commonly used by other researchers. Experimental results showed that the 3- Staged clustering algorithm performed better than the compared individual methods including K-means, TwoStep and also some ensemble based methods such as K-ANMI, and ccdByEnsemble. They also showed that the proposed similarity measure is very effective in improving the clustering quality. Besides, they showed that our proposed method for estimating the K value identified the correct number of clusters for most of the tested datasets.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available