Use this URL to cite or link to this record in EThOS:
Title: New statistical estimation methodology for active learning and classification
Author: Evans, Lewis Percival Gordon
ISNI:       0000 0004 5920 9343
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2016
Availability of Full Text:
Access from EThOS:
Access from Institution:
Object classification by learning from data is a vast area of statistics and machine learning. Within classification, unlabelled data may be plentiful, allowing a few objects to be chosen for labelling by an expert. Systematically choosing these few objects can maximise classifier improvement: this is the problem of active learning (AL). Many heuristic methods coexist with theoretical approaches making substantial assumptions, leaving a gulf between theory and practice. There is a plethora of applications demanding better algorithms and understanding. Experimental studies give a very mixed picture of results, making AL performance rather mysterious. To explore this, a large scale empirical study examines performance in detail. One approach to active learning is to consider the optimal selection behaviour. Defining optimality by classifier improvement produces a new characterisation of optimal AL behaviour. This optimum yields theoretical insights and practical algorithms for applications, unifying theory and practice. This approach is model retraining improvement (MRI), a novel statistical estimation framework for AL. MRI generates a new guarantee for AL, that an unbiased MRI estimator should outperform random selection on average. New statistical AL algorithms are constructed to estimate the MRI optimum, revealing intricate estimation issues. One new algorithm in particular performs strongly in a large-scale experimental study, compared to standard AL methods. MRI is entirely general in terms of problems, classifiers and loss functions. AL shows that classification examples are not created equal; this diversity of example quality implies that systematic selection and modification can both improve classifier performance. This idea is extended to classification, where label improbability gives a new definition of quality (improbability given the covariates). Handling improbable labels (HIL) defines actions to modify the training data, by pruning, relabelling and weighting, to reduce the impact of the improbable labels. Two large experimental studies establish the effectiveness of HIL algorithms.
Supervisor: Adams, Niall ; Anagnostopoulos, Christoforos Sponsor: Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral