Title:
|
A novel clustering algorithm with a new similarity measure and ensemble methods for mixed data clustering
|
This thesis addressed some specific issues in clustering: (1) clustering algorithms, (2)
similarity measures, (3) number of clusters, K, and (4) clustering ensemble methods.
Following on an in-depth review of clustering methods, a new three staged (3-Staged)
clustering algorithm is proposed, with new three key aspects: (1) a new method for
automatically estimating the K value, (2) a new similarity measure and (3) initiating the
clustering process with a promising BASE. A BASE is a real sample that acts like a
centroid or a medoid in common clustering methods but it is determined differently in our
approach. A new similarity measure is defined particularly to reflect the degree of
relative change between data samples, and more importantly to be able to accommodate
numerical and categorical variables. We have proven mathematically that the proposed
similarity measure meets the three properties of the metric measure. This research also
investigated the problem of determining the appropriate number of clusters in a dataset
and devised a novel function, which is integrated into our 3-Staged clustering algorithm,
to automatically estimate the most appropriate number of clusters, K. Based on our new
3-Staged clustering algorithm, we developed two new ensemble algorithms. For all
experiments, we used publicly available real-world benchmark datasets as these datasets
have been commonly used by other researchers. Experimental results showed that the 3-
Staged clustering algorithm performed better than the compared individual methods
including K-means, TwoStep and also some ensemble based methods such as K-ANMI,
and ccdByEnsemble. They also showed that the proposed similarity measure is very
effective in improving the clustering quality. Besides, they showed that our proposed
method for estimating the K value identified the correct number of clusters for most of
the tested datasets.
|