Use this URL to cite or link to this record in EThOS:
Title: Improving statistical learning within functional genomic experiments by means of feature selection
Author: Mahmoud, Osama
ISNI:       0000 0004 6056 8175
Awarding Body: University of Essex
Current Institution: University of Essex
Date of Award: 2015
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
A statistical learning approach concerns with understanding and modelling complex datasets. Based on a given training data, its main aim is to build a model that maps the relationship between a set of input features and a considered response in a predictive way. Classification is the foremost task of such a learning process. It has applications encompassing many important fields in modern biology, including microarray data as well as other functional genomic experiments. Microarray technology allow measuring tens of thousands of genes (features) simultaneously. However, the expressions of these genes are usually observed in a small number, tens to few hundreds, of tissue samples (observations). This common characteristic of high dimensionality has a great impact on the learning processes, since most of genes are noisy, redundant or non-relevant to the considered learning task. Both the prediction accuracy and interpretability of a constructed model are believed to be enhanced by performing the learning process based only on selected informative features. Motivated by this notion, a novel statistical method, named Proportional Overlapping Scores (POS), is proposed for selecting features based on overlapping analysis of gene expression data across different classes of a considered classification task. This method results in a measure, called POS score, of a feature's relevance to the learning task. POS is further extended to minimize the redundancy among the selected features. The proposed approaches are validated on several publicly available gene expression datasets using widely used classifiers to observe effects on their prediction accuracy. Selection stability is also examined to address the captured biological knowledge in the obtained results. The experimental results of classification error rates computed using the Random Forest, k NearestNeighbor and Support VectorMachine classifiers show that the proposals achieve a better performance than widely used gene selection methods.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA Mathematics