Use this URL to cite or link to this record in EThOS:  http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.725565 
Title:  A concentration inequality based statistical methodology for inference on covariance matrices and operators  
Author:  Kashlak, Adam B. 
ORCID:
0000000240507784
ISNI:
0000 0004 6424 3757


Awarding Body:  University of Cambridge  
Current Institution:  University of Cambridge  
Date of Award:  2017  
Availability of Full Text: 


Abstract:  
In the modern era of high and infinite dimensional data, classical statistical methodology is often rendered inefficient and ineffective when confronted with such big data problems as arise in genomics, medical imaging, speech analysis, and many other areas of research. Many problems manifest when the practitioner is required to take into account the covariance structure of the data during his or her analysis, which takes on the form of either a high dimensional low rank matrix or a finite dimensional representation of an infinite dimensional operator acting on some underlying function space. Thus, novel methodology is required to estimate, analyze, and make inferences concerning such covariances. In this manuscript, we propose using tools from the concentration of measure literature–a theory that arose in the latter half of the 20th century from connections between geometry, probability, and functional analysis–to construct rigorous descriptive and inferential statistical methodology for covariance matrices and operators. A variety of concentration inequalities are considered, which allow for the construction of nonasymptotic dimensionfree confidence sets for the unknown matrices and operators. Given such confidence sets a wide range of estimation and inferential procedures can be and are subsequently developed. For high dimensional data, we propose a method to search a concentration in equality based confidence set using a binary search algorithm for the estimation of large sparse covariance matrices. Both subGaussian and subexponential concentration inequalities are considered and applied to both simulated data and to a set of gene expression data from a study of small round bluecell tumours. For infinite dimensional data, which is also referred to as functional data, we use a celebrated result, Talagrand’s concentration inequality, in the Banach space setting to construct confidence sets for covariance operators. From these confidence sets, three different inferential techniques emerge: the first is a ksample test for equality of covariance operator; the second is a functional data classifier, which makes its decisions based on the covariance structure of the data; the third is a functional data clustering algorithm, which incorporates the concentration inequality based confidence sets into the framework of an expectationmaximization algorithm. These techniques are applied to simulated data and to speech samples from a set of spoken phoneme data. Lastly, we take a closer look at a key tool used in the construction of concentration based confidence sets: Rademacher symmetrization. The symmetrization inequality, which arises in the probability in Banach spaces literature, is shown to be connected with optimal transport theory and specifically the Wasserstein distance. This insight is used to improve the symmetrization inequality resulting in tighter concentration bounds to be used in the construction of nonasymptotic confidence sets. A variety of other applications are considered including tests for data symmetry and tightening inequalities in Banach spaces. An R package for inference on covariance operators is briefly discussed in an appendix chapter.


Supervisor:  Aston, John A. D. ; Nickl, Richard  Sponsor:  National Security Agency  
Qualification Name:  Thesis (Ph.D.)  Qualification Level:  Doctoral  
EThOS ID:  uk.bl.ethos.725565  DOI:  
Keywords:  Sparsity ; Thresholding estimator ; Procrustes ; Functional Data ; Talagrand's Inequality ; Log Sobolev Inequality ; SubGaussian ; SubExponential ; Classification ; Clustering ; Banach Space ; Rademacher Symmetrization ; Wasserstein Distance ; High Dimensional Data  
Share: 