Use this URL to cite or link to this record in EThOS:
Title: Kernel-based hypothesis tests : large-scale approximations and Bayesian perspectives
Author: Zhang, Qinyi
ISNI:       0000 0004 8507 3443
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
This thesis contributes to the field of nonparametric hypothesis testing (i.e. two-sample and independence testing) by providing a large-scale framework and developing a Bayesian perspective. We focus on nonparametric measures of homogeneity and dependence by considering the Hilbert norms between the embeddings of probability distributions in the reproducing kernel Hilbert space (RKHS). The rich representation provided by the associated kernel feature map enables the use of multivariate or non-Euclidean observations (e.g. strings and graphs) and it leads to powerful tests that are able to solve challenging problems given enough observations. However, the cost of computing the kernel matrix scales at least quadratically in the number of samples and makes it prohibitive to use in modern large-scale datasets. First, we propose three estimators of the well-known kernel dependence measure, the Hilbert Schmidt Independence Criterion (HSIC), namely the block-based estimator, the Nystrom estimator and the random Fourier feature (RFF) estimator, and establish the corresponding linear time independence test for each of the estimators. Secondly, we consider a normalised version of HSIC, the NOrmalised Cross COvariance (NOCCO) statistic, and propose an RFF approximated NOCCO. This results in a distribution free test that is robust to the kernel bandwidth misspecification. Thirdly, we propose a two-step conditional independence test that extends the popular two-step approach REgression with Subsequent Independence Test (RESIT) through RKHS valued regressions. When used as a part of the classical PC algorithm for causal inference, the resulting algorithm is more robust to hidden variables that induce nonfunctional associations. Finally, we utilise the classical Bayes factor formalism for model comparison and propose a Bayesian two-sample test by modelling the witness function of the well-known kernel measure of homogeneity, the Maximum Mean Discrepancy (MMD), with a Gaussian Process.
Supervisor: Sejdinovic, Dino ; Filippi, Sarah ; Teh, Yee Whye Sponsor: Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Machine learning--Statistical methods