Use this URL to cite or link to this record in EThOS:
Title: A novel variable selection method for classification with application to single nucleotide polymorphism data
Author: Hassan, N. B.
Awarding Body: University of Liverpool
Current Institution: University of Liverpool
Date of Award: 2017
Availability of Full Text:
Access from EThOS:
Access from Institution:
Introduction and Aims: In recent years, there has been a growing interest in studying genetic data so as to answer specific medical questions; for example, indicative biomarkers that can accurately predict (classify) outcomes (e.g. healthy and disease or different categories of patients' response to treatment). In genome-wide data analysis, a typical procedure is to use a variable selection approach, often univariable, where the primary aim is to select the most important genetic variants, particularly Single Nucleotide Polymorphisms (SNPs), associated with an outcome of interest. This thesis proposes a novel variable selection method by considering the multivariate nature of the genetic data. The aim of this thesis is threefold: (i) to develop a quantitative variable selection method for classification which can be used in the multivariate setting, computationally inexpensive and easy to understand and to apply, (ii) to propose a multi-step approach that selects SNPs and evaluates the classification performance of the resulting models in a cross-validation framework, and (iii) to jointly model the longitudinal clinical and SNP data for classification using the Standard and New Antiepileptic Drugs (SANAD) dataset. Methods: A literature search was conducted to study the different approaches of variable selection and their relationship with classification performance. A novel variable selection method, tSNR within a logistic regression framework was developed to select the most informative SNPs. In addition, a multi-step framework that involved univariable and multivariable selection in a cross-validation setting was proposed. Then, the filter metric tSNR and the multi-step framework were assessed using simulated datasets. The methods were further examined using an epilepsy pharmacogenomics dataset (EpiPGX) in which the phenotype of interest is the remission from seizures status after receiving first well-tolerated antiepileptic drugs (AEDs). A second epilepsy dataset from the SANAD trial was used as the validation dataset. Within the SANAD dataset, the longitudinal clinical and SNP data were jointly modelled using a longitudinal discriminant analysis (LoDA) approach with multivariate generalised linear mixed model (MGLMM). The classification performance was measured by calculating the probability of correct classification (PCC) and area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV). Results: The literature review suggested the need for variable selection methods, which could potentially aid better classification accuracy. In the simulation study, the univariable tSNR ranking was able to capture the causal SNPs in the top ten ranked SNPs. In addition, within the proposed framework, the results using simulated datasets suggested that the classification performance using SNPs selected by cumulative tSNR (multivariate) are better than the SNPs based on univariable tSNR ranking. The results were further confirmed using the real clinical datasets. The addition of SNP data to the longitudinal model based on clinical data improved the mean prediction time at which patients who will not achieve remission from seizures within five years of commencing treatment are identified. However, it did not provide an improvement to the classification performance. Conclusions: The developed approach using a tSNR filter metric proved to be effective in ranking and selecting subset of SNPs that are associated with the outcome of interest. The SNPs selected by tSNR were shown to give good classification accuracy. Also, by jointly modelling the longitudinal clinical data and SNP data (selected using tSNR) in a longitudinal model the prediction time at which patients can be classified was improved.
Supervisor: Garcia-Finana, Marta ; Czanner, Gabriela ; Jorgensen, Andrea Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral