Use this URL to cite or link to this record in EThOS:
Title: Sparse multivariate models for pattern detection in high-dimensional biological data
Author: Wang, Zi
ISNI:       0000 0004 7233 0623
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2015
Availability of Full Text:
Access from EThOS:
Access from Institution:
Recent advances in technology have made it possible and affordable to collect biological data of unprecedented size and complexity. While analysing such data, traditional statistical methods and machine learning algorithms suffer from the curse of dimensionality. Parsimonious models, which may refer to parsimony in model structure and/or model parameters, have been shown to improve both biological interpretability of the model and the generalisability to new data. In this thesis we are concerned with model selection in both supervised and unsupervised learning tasks. For supervised learnings, we propose a new penalty called graphguided group lasso (GGGL) and employ this penalty in penalised linear regressions. GGGL is able to integrate prior structured information with data mining, where variables sharing similar biological functions are collected into groups and the pairwise relatedness between groups are organised into a network. Such prior information will guide the selection of variables that are predictive to a univariate response, so that the model selects variable groups that are close in the network and important variables within the selected groups. We then generalise the idea of incorporating network-structured prior knowledge to association studies consisting of multivariate predictors and multivariate responses and propose the network-driven sparse reduced-rank regression (NsRRR). In NsRRR, pairwise relatedness between predictors and between responses are represented by two networks, and the model identifies associations between a subnetwork of predictors and a subnetwork of responses such that both subnetworks tend to be connected. For unsupervised learning, we are concerned with a multi-view learning task in which we compare the variance of high-dimensional biological features collected from multiple sources which are referred as “views”. We propose the sparse multi-view matrix factorisation (sMVMF) which is parsimonious in both model structure and model parameters. sMVMF can identify latent factors that regulate variability shared across all views and the variability which is characteristic to a specific view, respectively. For each novel method, we also present simulation studies and an application on real biological data to illustrate variable selection and model interpretability perspectives.
Supervisor: Montana, Giovanni Sponsor: Biotechnology and Biological Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral