Use this URL to cite or link to this record in EThOS:
Title: Variable selection in the curse of dimensionality
Author: Mares, Mihaela Andreea
ISNI:       0000 0004 6348 4989
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2017
Availability of Full Text:
Access from EThOS:
Access from Institution:
High-throughput technologies nowadays are leading to massive availability of data to be explored. Therefore, we are keen to build mathematical and statistical meth- ods for extracting as much value from the available data as possible. However, the large dimensionality in terms of both sample size and number of features or variables poses new challenges. The large number of samples can be tackled more easily by increasing computational power and making use of distributed computation tech- nologies. The large number of features or variables poses the risk of explaining variation in both noise and signal with the wrong explanatory variables. One ap- proach to overcome this problem is to select a smaller set of features from the initial set which are most relevant given an assumed prediction model. This approach is called variable or feature selection and implies using a bias or statistical assumption about which features should be considered more relevant. Different feature selection are using different statistical assumptions about the mathematical relation between predicted and explanatory variables and about which explanatory variables should be considered more relevant. Our first contribution in this thesis is to combine the strength of different variables selection methods relying on different statistical assumptions. We start by classifying existing feature selection methods based on their assumptions and assessing their capacity of scaling for high-dimensional data, particularly when the number of samples is much smaller than the number of fea- tures. We propose a new algorithm consisting of combining results from different feature selection methods relying on disjoint assumptions about the function that generated the data and we show that our method will lead to better sensitivity than using each method individually. The assumption of a linear relationship between the predicted variable and the explanatory variables is one of the most widely used simplifying assumptions. Our second contribution is to prove that at least one fea- ture selection algorithm based on the linearity assumption is consistent even when the underlying function that generated the data is not necessarily linear. Based on these theoretical findings we propose a new algorithm which provides better results when the underlying function that generated the data is at most partially linear. Neural networks and in particular deep learning architectures have been shown to be able to fit highly non-linear prediction models when given sufficient training ex- amples. However, they do not embed feature selection mechanisms. We contribute by assessing the performance of these models when given a large number of features and less samples, proposing a method for feature selection and showing in which circumstances combining this feature selection method with deep learning architec- tures will outperform not using feature selection. Several feature selection methods as well as the new methods we have proposed in this thesis rely on re-sampling techniques or using different algorithms for the same dataset. Their advantage is partially gained by using extra computational power. Therefore, our last contribu- tion consists of an efficient data distribution and load balanced parallel calculation for re-sampling based algorithms.
Supervisor: Guo, Yike Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral