Use this URL to cite or link to this record in EThOS: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755332
Title: Model selection, union and assembling in practical data analysis : methods and case study
Author: Muhammad, Awaz K.
ISNI:       0000 0004 7428 3277
Awarding Body: University of Leicester
Current Institution: University of Leicester
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Access from Institution:
Abstract:
The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumption’. The real data from almost 2000 respondents have been analysed. My role was in data analysis and risk assessment. The central problem is in the search and validation of psychological predictors of consumption of different drugs. Eight data mining algorithms were used for user/nonuser classification: decision trees, random forests, k-nearest neighbours, linear discriminant analysis, Gaussian mixtures, probability density function estimation by radial basis functions, logistic regression, and naïve Bayes. Correlation analysis based on the Pearson’s correlation coefficient and on relative information gain revealed the existence of groups of drugs with strongly correlated consumption. Three correlation pleiades were identified. Classifiers with sensitivity and specificity being greater than 70% for almost all classification tasks were obtained. Secondly, several new methods and approaches to feature selection were proposed and tested on the drug consumption database and on several other publicly available databases. These methods include ‘double Kaiser selection’ for selection of the main factors (principal components) and main attributes. Consideration of each attribute as a distribution on factors allowed us to apply any Kaiser rule for feature selection as well. We developed a methodology for creation and utilisation controllable multicollinearity. Multicollinearity can be useful because it allows to correct mistakes in data and to evaluate missed data. It is undesirable because many statistical tasks become ill-conditional. Alternative attribute sets approach (AASA) can determine several sets of relevant attributes that can be used to solve original problems separately. We tested AASA on several classification problems. We demonstrated that this methodology could be more accurate than the best traditional feature selection methods.
Supervisor: Gorban, Alexander ; Mirkes, Evgeny Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.755332  DOI: Not available
Share: