Use this URL to cite or link to this record in EThOS:
Title: Novel hierarchical feature selection methods for classification and their application to datasets of ageing-related genes
Author: Wan, Cen
ISNI:       0000 0004 5922 9731
Awarding Body: University of Kent
Current Institution: University of Kent
Date of Award: 2015
Availability of Full Text:
Access from EThOS:
Access from Institution:
Hierarchical Feature Selection (HFS) is an under-explored subarea of data mining/machine learning. Unlike conventional (flat) feature selection algorithms, HFS algorithms work by exploiting hierarchical (generalisation-specialisation) relationships between features, in order to try to improve the predictive accuracy of classifiers. The basic idea is to remove hierarchical redundancy between features, where the presence of a feature in an instance implies the presence of all ancestors of that feature in that instance. By using an HFS algorithm to select a feature subset where the hierarchical redundancy among features is eliminated or reduced, and then giving only the selected feature subset to a classification algorithm, it is possible to improve the predictive accuracy of classification algorithms. In terms of applications, this thesis focuses on datasets of ageing-related genes. This type of dataset is an interesting type of application for data mining methods due to the technical difficulty and ethical issues associated with doing ageing experiments with humans and the strategic importance of research on the biology of ageing - since age is the greatest risk factor for a number of diseases, but is still a not well understood biological process. This thesis offers contributions mainly to the area of data mining/machine learning, but also to bioinformatics and the biology of ageing, as discussed next. The first and main type of contribution consists of four novel HFS algorithms, namely: select Hierarchical Information Preserving (HIP) features, select Most Relevant (MR) features, the hybrid HIP–MR algorithm, and the Hierarchy-based Redundancy Eliminated Tree Augmented Naive Bayes (HRE–TAN) algorithm. These algorithms perform lazy learning-based feature selection - i.e. they postpone the learning process to the moment when testing instances are observed and select a specific feature subset for each testing instance. HIP, MR and HIP–MR select features in a data pre-processing phase, before running a classification algorithm, and they select features that can be used as input by any lazy classification algorithm. In contrast, HRE–TAN is a feature selection process embedded in the construction of a lazy TAN classifier. The second type of contribution, relevant to the areas of data mining and bioinformatics, consists of two novel algorithms that exploit the pre-defined structure of the Gene Ontology (GO) and the results of a flat or hierarchical feature selection algorithm to create the network topology of a Bayesian Network Augmented Naive Bayes (BAN) classifier. These are called GO–BAN algorithms. The proposed HFS algorithms were in general evaluated in combination with lazy versions of three Bayesian network classifiers, namely Naïve Bayes, TAN and GO–BAN - except that HRE–TAN works only with TAN. The experiments involved comparing the predictive accuracy obtained by these classifiers using the features selected by the proposed HFS algorithms with the predictive accuracy obtained by these classifiers using the features selected by flat feature selection algorithms, as well as the accuracy obtained by the classifiers using all original features (without feature selection) as a baseline. The experiments used a number of ageing-related datasets, where the instances being classified are genes, the predictive features are GO terms describing hierarchical gene functions, and the classes to be predicted indicate whether a gene has a pro-longevity or anti-longevity effect in the lifespan of a model organism (yeast, worm, fly or mouse). In general, with the exception of the hybrid HIP–MR which did not obtain good results, the other three proposed HFS algorithms (HIP, MR, HRE–TAN) improved the predictive performance of the baseline Bayesian network classifiers - i.e. in general the classifiers obtained higher accuracies when using only the features selected by the HFS algorithm than when using all original features. Overall, the most successful of the four HFS algorithms was HIP, which outperformed all other (hierarchical or flat) feature selection algorithms when used in combination with each of the Naive Bayes, TAN and GO–BAN classifiers. The difference of predictive accuracy between HIP and the other feature selection algorithms was almost always statistically significant - except that the difference of accuracy between HIP and MR was not significant with TAN. Comparing different combinations of a HFS algorithm and a Bayesian network classifier, HIP+NB and HIP+GO–BAN were both the best combination, with the same average rank across all datasets. They obtained predictive accuracies statistically significantly higher than the accuracies obtained by all other combinations of HFS algorithm and classifier. The third type of contribution of this thesis is a contribution to the biology of ageing. More precisely, the proposed HIP and MR algorithms were used to produce rankings of GO terms in decreasing order of their usefulness for predicting the pro-longevity or anti-longevity effect of a gene on a model organism; and the top GO terms in these rankings were interpreted with the help of a biologist expert on ageing, leading to potentially relevant patterns about the biology of ageing.
Supervisor: Freitas, Alex A. Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Q Science