Use this URL to cite or link to this record in EThOS:
Title: The noise component in model-based clustering
Author: Coretto, Pietro
ISNI:       0000 0004 2672 0778
Awarding Body: University of London
Current Institution: University College London (University of London)
Date of Award: 2008
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Model-based cluster analysis is a statistical tool used to investigate group-structures in data. Finite mixtures of Gaussian distributions are a popular device used to model elliptical shaped clusters. Estimation of mixtures of Gaussians is usually based on the maximum likelihood method. However, for a wide class of finite mixtures, including Gaussians, maximum likelihood estimates are not robust. This implies that a small proportion of outliers in the data could lead to poor estimates and clustering. One way to deal with this is to add a "noise component", i.e. a mixture component that models the outliers. In this thesis we explore this approach based on three contributions. First, Fraley and Raftery (1993) propose a Gaussian mixture model with the addition of a uniform noise component with support on the data range. We generalize this approach by introducing a model, which is a finite mixture of location-scale distributions mixed with a finite number of uniforms supported on disjoint subsets of the data range. We study identifiability and maximum likelihood estimation, and provide a computational procedure based on the EM algorithm. Second, Hennig (2004) proposed a sort of model in which the noise component is represented by a fixed improper density, which is a constant on the real line. He shows that the resulting estimates are robust to extreme outliers. We define a maximum likelihood type estimator for such a model and study its asymptotic behaviour. We also provide a method for choosing the improper constant density, and a computational procedure based on the EM algorithm. The third contribution is an extensive simulation study in which we measure the performance of the previous two methods and certain other robust method ologies proposed in the literature.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available