Title:

Some investigations in discriminant analysis with mixed variables

The location model is a potential basis for discriminating between groups of objects
with mixed types of variables. The model specifies a parametric form for the conditional
distribution of the continuous variables given each pattern of values of the
categorical variables, thus leading to a theoretical discriminant function between
the groups. To conduct a practical discriminant analysis, the objects must first be
sorted into the cells of a multinomial table generated from the categorical values,
and the model parameters must then be estimated from the data. However, in many
practical situations some of the cells are empty, which prevents simple implementation
of maximum likelihood estimation and restricts the feasibility of linear model
estimators to cases with relatively few categorical variables.
This deficiency was overcome by nonparametric smoothing estimation proposed
by Asparoukhov and Krzanowski (2000). Its usual implementation uses exponential
and piecewise smoothing functions for the continuous variables, and adaptive
weighted nearest neighbour for the categorical variables. Despite increasing the
range of applicability, the smoothing parameters that are chosen by maximising
the leaveoneout pseudolikelihood depend on distributional assumptions, while,
the smoothing method for the categorical variables produces erratic values if the
number of variables is large. This thesis rectifies these shortcomings, and extends
location model methodology to situations where there are large numbers of mixed
categorical and continuous variables.
Chapter 2 uses the simplest form of the exponential smoothing function for the continuous
variables and describes how the smoothing parameters can instead be chosen
by minimising either the leaveoneout error rate or the leaveoneout Brier score,
neither of which make distributional assumptions. Alternative smoothing methods,
namely a kernel and a weighted form of the maximum likelihood, are also investigated
for the categorical variables. Numerical evidence in Chapter 3 shows that
there is little to choose among the strategies for estimating smoothing parameters
and among the smoothing methods for the categorical variables. However, some of
the proposed smoothing methods are more feasible when the number of parameters to be estimated is reduced.
Chapter 4 reviews previous work on problems of high dimensional feature variables,
and focuses on selecting variables on the basis of the distance between groups. In
particular, the KullbackLeibler divergence is considered for the location model,
but existing theory based on maximum likelihood estimators is not applicable for
general cases. Chapter 5 therefore describes the implementation of this distance for
smoothed estimators, and investigates its asymptotic distribution. The estimated
distance and its asymptotic distribution provide a stopping rule in a sequence of
searching processes, either by forward, backward or stepwise selections, following
the test for no additional information. Simulation results in Chapter 6 exhibit the
feasibility of the proposed variable selection strategies for large numbers of variables,
but limitations in several circumstances are identified. Applications to real data sets
in Chapter 7 show how the proposed methods are competitive with, and sometimes
better than other existing classification methods. Possible future work is outlined
in Chapter 8.
