Use this URL to cite or link to this record in EThOS:
Title: Modern k-nearest neighbour methods in entropy estimation, independence testing and classification
Author: Berrett, Thomas Benjamin
ISNI:       0000 0004 6424 3730
Awarding Body: University of Cambridge
Current Institution: University of Cambridge
Date of Award: 2017
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Nearest neighbour methods are a classical approach in nonparametric statistics. The k-nearest neighbour classifier can be traced back to the seminal work of Fix and Hodges (1951) and they also enjoy popularity in many other problems including density estimation and regression. In this thesis we study their use in three different situations, providing new theoretical results on the performance of commonly-used nearest neighbour methods and proposing new procedures that are shown to outperform these existing methods in certain settings. The first problem we discuss is that of entropy estimation. Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this chapter, we seek entropy estimators that are efficient and achieve the local asymptotic minimax lower bound with respect to squared error loss. To this end, we study weighted averages of the estimators originally proposed by Kozachenko and Leonenko (1987), based on the k-nearest neighbour distances of a sample. A careful choice of weights enables us to obtain an efficient estimator in arbitrary dimensions, given sufficient smoothness, while the original unweighted estimator is typically only efficient in up to three dimensions. A related topic of study is the estimation of the mutual information between two random vectors, and its application to testing for independence. We propose tests for the two different situations of the marginal distributions being known or unknown and analyse their performance. Finally, we study the classical k-nearest neighbour classifier of Fix and Hodges (1951) and provide a new asymptotic expansion for its excess risk. We also show that, in certain situations, a new modification of the classifier that allows k to vary with the location of the test point can provide improvements. This has applications to the field of semi-supervised learning, where, in addition to labelled training data, we also have access to a large sample of unlabelled data.
Supervisor: Samworth, Richard Sponsor: Sims Scholarship
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
Keywords: Nonparametric statistics ; Nearest neighbour methods ; Entropy Estimation ; Independence Testing ; Classification