Use this URL to cite or link to this record in EThOS:
Title: Model selection in probability density estimation using Gaussian mixtures
Author: Sardo, Lucia
ISNI:       0000 0001 3552 9885
Awarding Body: University of Surrey
Current Institution: University of Surrey
Date of Award: 1997
Availability of Full Text:
Access from EThOS:
Access from Institution:
This thesis proposes Gaussian Mixtures as a flexible semiparametric tool for density estimation and addresses the problem of model selection for this class of density estimators. First, a brief introduction to various techniques for model selection proposed in literature is given. The most commonly used techniques are cross validation nad methods based on data reuse and they all are either computationally very intensive or extremely demanding in terms of training set size. Another class of methods known as information criteria allows model selection at a much lower computational cost and for any sample size. The main objective of this study is to develop a technique for model selection that is not too computationally demanding, while capable of delivering an acceptable performance on a range of problems of various dimensionality. Another important issue addressed is the effect of the sample size. Large data sets are often difficult and costly to obtain, hence keeping the sample size within reasonable limits is also very important. Nevertheless sample size is central to the problem of density estimation and one cannot expect good results with extremely limited samples. Information Criteria are the most suitable candidates for a model selection procedure fulfilling these requirements. The well-known criterion Schwarz's Bayesian Information Criterion (BIC) has been analysed and its deficiencies when used with data of large dimensionality data are noted. A modification that improves on BIC criterion is proposed and named Maximum Penalised Likelihood (MPL) criterion. This criterion has the advantage that it can adapted to the data and its satisfactory performance is demonstrated experimentally. Unfortunately all information criteria, including the proposed MPL, suffer from a major drawback: a strong assumption of simplicity of the density to be estimated. This can lead to badly underfitted estimates, especially for small sample size problems. As a solution to such deficiencies, a procedure for validating the different models, based on an assessment of the model predictive performance, is proposed. The optimality criterion for model selection can be formulated as follow; if a model is able to predict the observed data frequencies within the statistical error, it is an acceptable model, otherwise it is rejected. An attractive feature of such a measure of goodness is the fact that it is an absolute measure, rather than a relative one, which would only provide a ranking between candidated models.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Applied mathematics