Use this URL to cite or link to this record in EThOS:
Title: Statistical analysis of genomic binding sites using high-throughput ChIP-seq data
Author: Nafisah, Ibrahim Ali H.
ISNI:       0000 0004 5923 4936
Awarding Body: University of Leeds
Current Institution: University of Leeds
Date of Award: 2015
Availability of Full Text:
Access through EThOS:
Access through Institution:
This thesis focuses on the statistical analysis of Chromatin immunoprecipitation sequencing (ChIP-Seq) data produced by Next Generation Sequencing (NGS). ChIP-Seq is a method to investigate interactions between protein and DNA. Specifically, the method aims to identify the binding sites of a particular protein of interest, such as a transcription factor, in the genome. In the context of cancer research, this information is important to check whether, for example, a particular transcription factor can be considered as a therapeutic target. The sequence data produced by ChIP-Seq experiment are in the form of mapped short sequences, which are called reads. The reads are counted at each single genomic position, and the read counts are the data to be analysed. There are many problems related to the analysis of ChIP-Seq data, and in this research we focus on three of them. First, in the analysis of ChIP-Seq data, the genome is not analysed in its entirety; instead the intensity of read counts is estimated locally. Estimating the intensity of read counts usually involves dividing the genome into small regions (windows). If the window size is small, the noise level (low read counts) would dominate and many empty windows would be observed. If the window size is large, the windows would have many small read counts, which would smooth out some important features. The need exists for an approach that enables researchers to choose an appropriate window size. To address this problem, an approach was developed to optimise the window size. The approach optimises the window size based on histogram construction. Note, the developed methodology is published in [46]. Second, different studies of ChIP-Seq can target different transcription factors and then give different conclusions, which is expected. However, they are all ChIP-Seq datasets and many of them are performed on the same genome, for example the human genome. So is there a pattern for the distribution of the counts? If the answer is yes, is the pattern common in all ChIP-Seq data? Answering this question can help in better understanding the biology behind this experiment. We try to answer this question by investigating RUNX1/ETO ChIP-Seq data. We try to develop a statistical model that is able to describe the data. We employ some observed features in ChIP-Seq data to improve the performance of the model. Although we obtained a model that is able to describe the RUNX1/ETO data, the model does not provide a good statistical fit to the data. Third, it is biologically important to know what changes (if any) occur at the binding sites under some biological conditions, for example in knock-out experiments. Changes in the binding sites can be either in the location of the sites or in the characteristics of the sites (for example, the density of the read counts), or sometimes both. Current approaches for differential binding sites analysis suffer from major drawbacks. First, unclear underlying models as a result of dependencies between methods used, for example peak finding and testing methods. Second, lack of accurate control of type-I error. Hence there is a need for approach(es) to address these drawbacks. To address this problem, we developed three statistical tests that are able to detect significantly differential regions between two ChIPSeq datasets. The tests are evaluated and compared to some current methodologies by using simulated and real ChIP-Seq datasets. The proposed tests exhibit more power as well as accuracy compared to current methodologies.
Supervisor: Gusnanto, A. ; Taylor, C. ; Westhead, D. Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available