Use this URL to cite or link to this record in EThOS:
Title: Graphical model selection for Gaussian conditional random fields in the presence of latent variables : theory and application to genetics
Author: Frot, Benjamin
ISNI:       0000 0004 6421 4251
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2016
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
The task of performing graphical model selection arises in many applications in science and engineering. The field of application of interest in this thesis relates to the needs of datasets that include genetic and multivariate phenotypic data. There are several factors that make this problem particularly challenging: some of the relevant variables might not be observed, high-dimensionality might cause identifiability issues and, finally, it might be preferable to learn the model over a subset of the collection while conditioning on the rest of the variables, e.g. genetic variants. We suggest addressing these problems by learning a conditional Gaussian graphical model, while accounting for latent variables. Building on recent advances in this field, we decompose the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this novel estimator, show that it is well-behaved in the high-dimensional regime and describe algorithms that can be used when the number of variables is in the thousands. Through simulations, we illustrate the conditions required for identifiability and show that this approach is consistent in a wider range of settings. In order to show the practical implications of our work, we apply our method to two real datasets and devise a metric that makes use of an independent source of information to assess the biological relevance of the estimates. In our first application, we use the proposed approach to model the levels of 39 metabolic traits conditional on hundreds of genetic variants, in two independent cohorts. We find our results to be better replicated across cohorts than the ones obtained with other methods. In our second application, we look at a high-dimensional gene expression dataset. We find that our method is capable of retrieving as many biologically relevant gene-gene interactions as other methods while retrieving fewer irrelevant interaction.
Supervisor: Jostins, Luke ; McVean, Gilean Sponsor: Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available