Pattern recognition using labelled and unlabelled data
This thesis presents the results of a three year investigation into combining labelled and
unlabelled data for data classification. In the present world, there are many fields in which
the quantity of data available to workers in that field has increased exponentially over the
last few years. This has in part been due to improved methods of automatic data capture
and in part due to improved electronic communication particularly via the internet. These
vast quantities of data require some form of processing in order to transform the data into
information. This is often a costly business requiring human (often expert) intervention.
Our rationale for this investigation is that we wish to augment the information provided by
the human experts with data which has not been processed by human experts. The actual
method we investigate is classification using both processed (labelled) and unprocessed
(unlabelled) data in order to reduce the requirement for human intervention.
In Chapter 2 of the thesis we review several aspects of this problem as it features in the
current literature. We discuss
• Classification versus clustering
• Error estimation - training, testing and validation data.
• Existing methods for combining labelled and unlabelled data
• Combining classifiers
• Artificial neural networks
• The sufficient level of labelled samples for a classification task.
• Active selection
These topics are revisited in the subsequent chapters in which we present our new work.
We begin to introduce our novel work in Chapter 3 when we discuss 5 major approaches
to combining labelled and unlabelled data to augment the classifier. The first and base
classifier is trained only on the labelled data. Subsequent methods to improve this classifier
• static labelling: using the labelled data to create the classifier and then using this
classifier to classify all the unlabelled data. The final training dataset is composed of
the union of the originally labelled and newly labelled datasets.
• dynamic labelling: incrementally retraining the classifier on a sample by sample
• majority clustering: the majority vote from the labelled samples in a cluster (found
without using labels) determines the classification of new data.
• semi-supervised clustering: the labels are actively used in the clustering process.
We investigate a particular semi-supervised method which we call refined clustering: we
perform clustering and then refine the clusters based on conflict levels of the labelled data
in each cluster. We discuss how the method exhibits reduction in error through its effect
on both bias and variance. We investigate methods of selecting which data points are to be
labelled for the initial labelled dataset.
In the next chapter, we discuss bagging, a method for combining classifiers and use the
method on Kohonen's Self Organising Maps (SOMs). Bagging is typically performed with
supervised classifiers but the SOM is an unsupervised topology preserving mapping which
raises issues which do not normally arise with bagging. We discuss several refinements
to the algorithm which enables us to confidently use the method with SOMs. Finally we
discuss supervised and semi-supervised versions of the SOM in the context of bagging.
In the next chapter, we consider the problem of estimating what fraction of a dataset
must be labelled before we can have confidence in the classifier trained on this labelled
dataset. We use sets of data points as a basis for each class in turn which allows us to minimise reconstruction error optimally for the members of that class but does not have
this effect on members of other classes. We put these concepts into the framework of a
negative feedback artificial neural network and show how separating projection and reconstruction
stages enables us to cluster datasets and, perhaps more importantly, to visualise
the structure of the datasets.
In the final chapter of new work, we discuss active and interactive selection of data
points for labelling. We are thus explicitly accepting the use of a human (but not human
expert) in the classification process but are trying to optimise this input by automatically
presenting the data so that the task is straightforward for the human. We specifically use
the kernel matrices which have been so important in the development of Support Vector
Machines (SVMs) and Kernel Principal Component analysis (KPCA) in a way which has
not previously been envisaged. We retrieve the sparseness feature for Kernel PCA which
exists for SVMs but is missing in standard KPCA.