Title:
|
Peak selection in metabolic profiles using functional data analysis
|
In this thesis we describe sparse principal component analysis (PCA) methods and apply them to the analysis of short multivariate time series in order to perform both dimensionality reduction and variable selection. We take a functional data analysis (FDA) modelling approach in which each time series is treated as a continuous smooth function of time or curve. These techniques have been applied to analyse time series data arising in the area of metabonomics. Metabonomics studies chemical processes involving small molecule metabolites in a cell. We use experimental data obtained from the COnsortium for MEtabonomic Toxicology (COMET) project which is formed by six pharmaceutical companies and Imperial College London, UK. In the COMET project repeated measurements of several metabolites over time were collected which are taken from rats subjected to different drug treatments. The aim of our study is to detect important metabolites by analysing the multivariate time series. Multivariate functional PCA is an exploratory technique to describe the observed time series. In its standard form, PCA involves linear combinations of all variables (i.e. metabolite peaks) and does not perform variable selection. In order to select a subset of important metabolites we introduce sparsity into the model. We develop a novel functional Sparse Grouped Principal Component Analysis (SGPCA) algorithm using ideas related to Least Absolute Shrinkage and Selection Operator (LASSO), a regularized regression technique, with grouped variables. This SGPCA algorithm detects a sparse linear combination of metabolites which explain a large proportion of the variance. Apart from SGPCA, we also propose two alternative approaches for metabolite selection. The first one is based on thresholding the multivariate functional PCA solution, while the second method computes the variance of each metabolite curve independently and then proceeds to these rank curves in decreasing order of importance. To the best of our knowledge, this is the first application of sparse functional PCA methods to the problem of modelling multivariate metabonomic time series data and selecting a subset of metabolite peaks. We present comprehensive experimental results using simulated data and COMET project data for different multivariate and functional PCA variants from the literature and for SGPCA . Simulation results show that that the SGPCA algorithm recovers a high proportion of truly important metabolite variables. Furthermore, in the case of SGPCA applied to the COMET dataset we identify a small number of important metabolites independently for two different treatment conditions. A comparison of selected metabolites in both treatment conditions reveals that there is an overlap of over 75 percent.
|