Use this URL to cite or link to this record in EThOS:
Title: Comprehensive review of classification algorithms for high dimensional datasets
Author: Syarif, Iwan
ISNI:       0000 0004 5346 9371
Awarding Body: University of Southampon
Current Institution: University of Southampton
Date of Award: 2014
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Machine Learning algorithms have been widely used to solve various kinds of data classification problems. Classification problem especially for high dimensional datasets have attracted many researchers in order to find efficient approaches to address them. However, the classification problem has become very complicated and computationally expensive, especially when the number of possible different combinations of variables is so high. In this research, we evaluate the performance of four basic classifiers (naïve Bayes, k-nearest neighbour, decision tree and rule induction), ensemble classifiers (bagging and boosting) and Support Vector Machine. We also investigate two widely-used feature selection algorithms which are Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). Our experiments show that feature selection algorithms especially GA and PSO significantly reduce the number of features needed as well as greatly reduce the computational cost. Furthermore, these algorithms do not severely reduce the classification accuracy and in some cases they can improve the accuracy as well. PSO has successfully reduced the number of attributes of 9 datasets to 12.78% of original attributes on average while GA is only 30.52% on average. In terms of classification performance, GA is better than PSO. The datasets reduced by GA have better classification performance than their original ones on 5 of 9 datasets while the datasets reduced by PSO have their classification performance improved in only 3 of 9 datasets. The total running time of four basic classifiers (NB, kNN, DT and RI) on 9 original datasets is 68,169 seconds while the total running time of the same classifiers on GA-reduced datasets is 3,799 seconds and on PSO-reduced dataset is only 326 seconds (more than 209 times faster). We applied ensemble classifiers such as bagging and boosting as a comparison. Our experiment shows that bagging and boosting do not give a significant improvement. The average improvement of bagging when applied to nine datasets is only 0.85% while boosting average improvement is 1.14%. Ensemble classifiers (both bagging and boosting) outperforms single classifier in 6 of 9 datasets. SVM has been proven to perform much better when dealing with high dimensional datasets and numerical features. Although SVM work well with default value, the performance of SVM can be improved significantly using parameter optimization. Our experiment shows SVM parameter optimization using grid search always finds near optimal parameter combination within the given ranges. SVM parameter optimization using grid search is very powerful and it is able to improve the accuracy significantly. Unfortunately, grid search is very slow; therefore it is very reliable only in low dimensional dataset with few parameters. SVM parameter optimization using Evolutionary Algorithm (EA) can be used to solve the problem of grid search. EA has proven to be more stable than grid search. Based on average running time, EA is almost 16 times faster than grid search (294 seconds compare to 4680 seconds). Overall, SVM with parameter optimization outperforms other algorithms in 5 of 9 datasets. However, SVM does not perform well in datasets which have non-numerical attributes.
Supervisor: Prugel-Bennett, Adam Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA75 Electronic computers. Computer science