Use this URL to cite or link to this record in EThOS:
Title: Development of unsupervised feature selection methods for high dimensional biomedical data in regression domain
Author: Sarac, Ferdi
ISNI:       0000 0004 7430 1068
Awarding Body: Northumbria University
Current Institution: Northumbria University
Date of Award: 2017
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
In line with technological developments, there is almost no limit to collect data of high dimension in various fields including bioinformatics. In most cases, these high dimensional datasets contain many irrelevant or noisy features which need to be filtered out to find a small but biologically meaningful set of attributes. Although there have been various attempts to select predictive feature sets from high dimensional data in classification and clustering, there have only been limited attempts to do this for regression problems. Since supervised feature selection methods tend to identify noisy features in addition to discriminative variables, unsupervised feature selection methods (USFSMs) are generally regarded as more unbiased approaches. The aim of this thesis is, therefore, to provide (i) a comprehensive overview of feature selection methods for regression problems where feature selection methods are shown along with their types, references, sources, and code repositories (ii) a taxonomy of feature selection methods for regression problems to assist researchers to select appropriate feature selection methods for their research (iii) a deep learning based unsupervised feature selection framework, DFSFR (iv) a K-means based unsupervised feature selection method, KBFS. To the best of our knowledge, DFSFR is the first deep learning based method to be designed particularly for regression tasks. In addition, a hybrid USFSM, DKBFS, is proposed which combines KBFS and DFSFR to select discriminative features from very high dimensional data. The proposed frameworks are compared with the state-of-the-art USFSMs, including Multi Cluster Feature Selection (MCFS), Embedded Unsupervised Feature Selection (EUFS), Infinite Feature Selection (InFS), Spectral Regression Feature Selection (SPFS), Laplacian Score Feature Selection (LapFS), and Term Variance Feature Selection (TV) along with the entire feature sets as well as the methods used in previous studies. To evaluate the effectiveness of proposed methods, four different case studies are considered: (i) a low dimensional RV144 vaccine dataset; (ii) three different high dimensional peptide binding affinity datasets; (iii) a very high dimensional GSE44763 dataset; (iv) a very high dimensional GSE40279 dataset. Experimental results from these data sets are used to validate the effectiveness of the proposed methods. Compared to state-of-the-art feature selection methods, the proposed methods achieve improvements in prediction accuracy of as much as 9% for the RV144 Vaccine dataset, 75% for the peptide binding affinity datasets, 3% for the GSE44763 dataset, and 55% for the GSE40279 dataset.
Supervisor: Seker, Huseyin ; Bouridane, Ahmed Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: B900 Others in Subjects allied to Medicine ; G400 Computer Science