Title:

New statistical perspectives on efficient Big Data algorithms for highdimensional Bayesian regression and model selection

This thesis is focused on the development of computationally efficient procedures for regression modelling with datasets containing a large number of observations. Standard algorithms be prohibitively computationally demanding on large n datasets, and we propose and analyse new computational methods for model fitting and selection. We explore three different generic strategies for tall datasets. Divide and conquer approaches split the full dataset into subsets, with the subsets then being analysed independently in parallel. The subset results are then pooled into an overall consensus. Subsampling based methods repeatedly use minibatches of data to estimate quantities of interest. The third strategy is 'sketching', a probabilistic data compression technique developed in the computer science community. Sketching uses random projection to compress the original large dataset, producing a smaller surrogate dataset that is less computationally demanding to work with. The sketched dataset can be used for approximate inference. We test our regression algorithms on a number of large n genetic datasets, aiming to find associations between genetic variants and red blood cell traits. Bayesian divide and conquer and subsampling methods have been studied in the fixed model setting but little attention has been given to model selection. An important task in Bayesian model selection is computation of the integrated likelihood. We propose divide and conquer and subsampling algorithms for estimating the integrated likelihood. The divide and conquer approach is based on data augmentation, which is particularly useful for logistic regression. The subsampling approach involves constructing upper and lower bounds on the integrated likelihood. Lower bounds can be formed using variational Bayes techniques and we show how subsampling can be used to estimate an upper bound on the integrated likelihood. Sketching algorithms generate a compressed set of responses and predictors than can then be used to estimate regression coefficients. Sketching algorithms use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. We examine the statistical properties of sketching algorithms, which allows us to quantify the error in the coefficients estimated using the sketched dataset. The proportion of variance explained by the model proves to be an important quantity in choosing between alternative sketching algorithms. This is particularly relevant to genetic studies, where the signal to noise ratio can be low. We also investigate sketching as a tool for posterior approximation. The sketched dataset can be used to generate an approximate posterior distribution over models. As expected, the quality of the posterior approximation increases with the number of observations in the sketched dataset. The tradeoff is that computational cost of sketching increases with the size of the desired sketched dataset. The main conclusion is that impractically large sketch sizes are needed to obtain a tolerable approximation of the posterior distribution over models. We test the performance of sketching for posterior approximation on a large genetic dataset. A key finding is that false positives are a major issue when performing model selection. Practical regression analysis with large n datasets can require specialised algorithms. Parallel processing, subsampling and random projection are all useful tools for computationally efficient regression modelling.
