Use this URL to cite or link to this record in EThOS:
Title: A two-sample distribution-free test with applications to correlated genomic data
Author: Telford, Alison Jane
ISNI:       0000 0004 7970 2119
Awarding Body: University of Leeds
Current Institution: University of Leeds
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
This thesis focuses on the identification of genomic regions that exhibit significant differences of Copy Number Alterations (CNA) between two clinical groups. CNA are a structural variation in the human genome where some regions have more or less copy number than the normal two copies. CNA patterns in some genomic regions across patients have been shown to be associated with disease phenotypes. Our interest is in testing which genomic regions exhibit different distributions between two clinical groups to aid classification of patients on their subtype of cancer and discover new genomic markers for phenotypic identification. To do this we apply a two-sample test on each genomic region to test the null hypothesis that two distributions are equal. Standard statistical tests are not adequate to deal with the characteristics of the data where the differences between the two groups lie in any one of the following aspects of the distribution: mean, variance, skewness, and multi-modality. When the null hypothesis is that two distributions are equal, the Anderson-Darling (AD) test is generally employed. The AD test was developed from the Cramer-von Mises (CvM) test statistic, which was originally proposed for a goodness-offit test. In the case of multi-modality, we find that the AD test often fails to identify true differences. We show, however, that the Cramer test - another modification to the CvM test - does not fail in the case of multi-modality. We have obtained the first four moments of the Cramer test statistic, which are not available previously. We also propose a new method for obtaining a p-value without using resampling techniques by approximating the distribution of the test statistic by a Generalised Pareto Distribution (GPD). By approximating the null distribution in this way, the calculation of the p-value is much faster than current methods, especially for large n. A simulation study indicates that the Cramer test is as powerful as other tests in simple cases and more powerful in more complicated cases. To test our method, we applied the Cramer test on each genomic region to compare two groups of 76 lung cancer patients - 38 of which have adenocarcinoma type lung cancer and the other 38 have squamous carcinoma type lung cancer. Comparisons with the current method for identifying genomic regions of interest, KC Smart, also indicate that our method works well and is arguably preferable. When the genome is split into separate regions, we show that adjacent (in genomic location) regions can exhibit very high correlation of CNA. High correlation between genomic locations suggests dependencies between the simultaneously performed tests. Because of these dependencies, multiplicity correction techniques for independent tests cannot be used alone as the number of independent tests performed is unknown. Methods exist to estimate the effective number of independent tests, however we find that these methods are slow and computationally expensive. Because of this, we extend work done on Fisher's method to combine dependent p-values. We compare this method to using a multivariate version of the Cramer test and show that the method produces similar results when performed on the lung cancer data set.
Supervisor: Gusnanto, Arief ; Taylor, Charles ; Wood, Henry Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available