Title:

Learning in high dimensions with asymmetric costs

A classifier making predictions in noisy environments with limited data will inevitably make some mistakes. However, some types of error are more harmful than others. For example, the cost of misidentifying an explosive as a benign electronic device is far greater than the cost of misidentifying a laptop as an explosive. In such cases it is of paramount significance that the relative cost of different types of errors is taken into account when constructing and assessing algorithms for making predictions under uncertainty. This thesis will develop a distribution dependent theory of learning with asymmetric costs. For a comprehensive coverage we consider four different learning scenarios: (1) Supervised learning with a fixed cost matrix, (2) Weakly supervised costsensitive learning with unknown costs, (3) Sequential costsensitive learning with bandit feedback and (4) Neyman Pearson classification for qualitatively asymmetric costs. In each of these scenarios we seek to identify the key properties of the data distribution which influence the difficulty of the learning problem. These include the distribution's dimensionality, smoothness and the level of noise. We shall pay particular attention to minimax rates which give the minimum risk attainable by a classification algorithm, uniformly, over a given family of distributions. We begin our analysis with a supervised setting in which the costs of the possible prediction errors are given by a known cost matrix. In this setting we give a natural generalisation of Tysbakov's margin condition which quantifies the level of noise in the data in relation to the optimal decision boundaries for the cost matrix. We then identify the minimax rates with an exponent which depends solely upon the dimensionality of the distribution, the smoothness of the regression function and the level of noise. Significantly, these rates hold for all compact submanifolds and for any reasonable cost matrix. We then show that the minimax rates are actually attained by a simple costsensitive knearest neighbour method. Departing from the fully supervised setting we turn to a class of weakly supervised costsensitive learning problems with stochastic and feature dependent costs. We show that the minimax rates have the same exponent as their supervised counter parts and once again, these rates are attainable by simple knearest neighbour methods. We then move to an online scenario in which the learner sequentially makes a prediction and then incurs the cost of their prediction, without ever observing either the true class label or the costs that would have been incurred if an alternative prediction had been made. This problem is equivalent to the problem of multiarmed bandits with covariates. We present the conceptually simple kNN UCB algorithm for this problem, which combines the kNN method from supervised learning with the UCB algorithm for multiarmed bandits. We establish a minimax optimal regret bound for this algorithm (up to logarithmic terms) which demonstrates that the method is capable of adapting automatically to the unknown intrinsic dimensionality of the distribution. An alternative approach to learning with asymmetric costs is the NeymanPearson paradigm. In this paradigm the learner is given a set of samples from both a null and a positive class. The learner's goal is to construct a classifier which minimises the false negative rate whilst satisfying a userspecified upper bound on the false positive rate. This approach to classification is appropriate in applications such as medical diagnosis when the different types of errors are of a fundamentally different category. We present a nonparametric approach to NeymanPearson learning which sidesteps the necessity of density estimation and applies when the data is distributed on an unknown manifold within a highdimensional ambient feature space. We prove a high probability distributiondependent performance bound. Finally, we present the first minimax lower bounds in the NeymanPearson setting. Overall this thesis presents a clear picture of the way in which the properties of the distribution affect the optimal learning rates for highdimensional classification problems with asymmetric costs. Moreover, we demonstrate that these optimal rates are often attained by nonparametric techniques which are both conceptually simple and straightforward to implement.
