Use this URL to cite or link to this record in EThOS:
Title: Data integration for regulatory module discovery
Author: Mishra, Alok
ISNI:       0000 0004 2732 0768
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2012
Availability of Full Text:
Access from EThOS:
Access from Institution:
Genomic data relating to the functioning of individual genes and their products are rapidly being produced using many different and diverse experimental techniques. Each piece of data provides information on a specific aspect of the cell regulation process. Integration of these diverse types of data is essential in order to identify biologically relevant regulatory modules. In this thesis, we address this challenge by analyzing the nature of these datasets and propose new techniques of data integration. Since microarray data is not available in quantities that are required for valid inference, many researchers have taken the blind integrative approach where data from diverse microarray experiments are merged. In order to understand the validity of this approach, we start this thesis with studying the heterogeneity of microarray datasets. We have used KL divergence between individual dataset distributions as well as an empirical technique proposed by us to calculate functional similarity between the datasets. Our results indicate that we should not use a blind integration of datasets and much care should be taken to ensure that we mix only similar types of data. We should also be careful about the choice of normalization method. Next, we propose a semi-supervised spectral clustering method which integrates two diverse types of data for the task of gene regulatory module discovery. The technique uses constraints derived from DNA-binding, PPI and TF-gene interactions datasets to guide the clustering (spectral) of microarray experiments. Our results on yeast stress and cell-cycle microarray data indicate that the integration leads to more biologically significant results. Finally, we propose a technique that integrates datasets under the principle of maximum entropy. We argue that this is the most valid approach in an unsupervised setting where we have no other evidence regarding the weights to be assigned to individual datasets. Our experiments with yeast microarray, PPI, DNA-binding and TF-gene interactions datasets show improved biological significance of results.
Supervisor: Gillies, Duncan ; Rueckert, Daniel Sponsor: Imperial College London ; Beit Trust
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral