Title:

Scalable Bayesian regression utilising marginal information

This thesis explores approaches to regression that utilise the treatment of covariates as random variables. The distribution of covariates, along with the conditional regression model Y  X, define the joint model over (Y,X), and in particular, the marginal distribution of the response Y. This marginal distribution provides a vehicle for the incorporation of prior information, as well as external, marginal data. The marginal distribution of the response provides a means of parameterisation that can yield scalable inference, simple prior elicitation, and, in the case of survival analysis, the complete treatment of truncated data. In many cases, this information can be utilised without need to specify a model for X. Chapter 2 considers the application of Bayesian linear regression where large marginal datasets are available, but the collection of response and covariate data together is limited to a small dataset. These marginal datasets can be used to estimate the marginal means and variances of Y and X, which impose two constraints on the parameters of the linear regression model. We define a joint prior over covariate effects and the conditional variance σ^{2} via a parameter transformation, which allows us to guarantee these marginal constraints are met. This provides a computationally efficient means of incorporating marginal information, useful when incorporation via the imputation of missing values may be implausible. The resulting prior and posterior have rich dependence structures that have a natural 'analysis of variance' interpretation, due to the constraint on the total marginal variance of Y. The concept of 'marginal coherence' is introduced, whereby competing models place the same prior on the marginal mean and variance of the response. Our marginally constrained prior can be extended by placing priors on the marginal variances, in order to perform variable selection in a marginally coherent fashion. Chapter 3 constructs a Bayesian nonparametric regression model parameterised in terms of FY , the marginal distribution of the response. This naturally allows the incorporation of marginal data, and provides a natural means of specifying a prior distribution for a regression model. The construction is such that the distribution of the ordering of the response, given covariates, takes the form of the PlackettLuce model for ranks. This facilitates a natural composite likelihood approximation that decomposes the likelihood into a term for the marginal response data, and a term for the probability of the observed ranking. This can be viewed as a extension to the partial likelihood for proportional hazards models. This convenient form leads to simple approximate posterior inference, which circumvents the need to perform MCMC, allowing scalability to large datasets. We apply the model to a US Census dataset with over 1,300,000 data points and more than 100 covariates, where the nonparametric prior is able to capture the highly nonstandard distribution of incomes. Chapter 4 explores the analysis of randomised clinical trial (RCT) data for subgroup analysis, where interest lies in the optimal allocation of treatment D(X), based on covariates. Standard analyses build a conditional model Y  X,T for the response, given treatment and covariates, which can be used to deduce the optimal treatment rule. We show that the treatment of covariates as random facilitates direct testing of a treatment rule, without the need to specify a conditional model. This provides a robust, efficient, and easytouse methodology for testing treatment rules. This nonparametric testing approach is used as a splitting criteria in a randomforest methodology for the exploratory analysis of subgroups. The model introduced in Chapter 3 is applied in the context of subgroup analysis, providing a Bayesian nonparametric analogue to this approach: where inference is based only on the order of the data, circumventing the requirement to specify a full datagenerating model. Both approaches to subgroup analysis are applied to data from an AIDS Clinical Trial.
