Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.700584
Title: An ensemble self-structuring neural network approach to solving classification problems with virtual concept drift and its application to phishing websites
Author: Mohammad, Rami Mustafa A.
ISNI:       0000 0004 5993 9450
Awarding Body: University of Huddersfield
Current Institution: University of Huddersfield
Date of Award: 2016
Availability of Full Text:
Access through EThOS:
Access through Institution:
Abstract:
Classification in data mining is one of the well-known tasks that aim to construct a classification model from a labelled input data set. Most classification models are devoted to a static environment where the complete training data set is presented to the classification algorithm. This data set is assumed to cover all information needed to learn the pertinent concepts (rules and patterns) related to how to classify unseen examples to predefined classes. However, in dynamic (non-stationary) domains, the set of features (input data attributes) may change over time. For instance, some features that are considered significant at time Ti might become useless or irrelevant at time Ti+j. This situation results in a phenomena called Virtual Concept Drift. Yet, the set of features that are dropped at time Ti+j might return to become significant again in the future. Such a situation results in the so-called Cyclical Concept Drift, which is a direct result of the frequently called catastrophic forgetting dilemma. Catastrophic forgetting happens when the learning of new knowledge completely removes the previously learned knowledge. Phishing is a dynamic classification problem where a virtual concept drift might occur. Yet, the virtual concept drift that occurs in phishing might be guided by some malevolent intelligent agent rather than occurring naturally. One reason why phishers keep changing the features combination when creating phishing websites might be that they have the ability to interpret the anti-phishing tool and thus they pick a new set of features that can circumvent it. However, besides the generalisation capability, fault tolerance, and strong ability to learn, a Neural Network (NN) classification model is considered as a black box. Hence, if someone has the skills to hack into the NN based classification model, he might face difficulties to interpret and understand how the NN processes the input data in order to produce the final decision (assign class value). In this thesis, we investigate the problem of virtual concept drift by proposing a framework that can keep pace with the continuous changes in the input features. The proposed framework has been applied to phishing websites classification problem and it shows competitive results with respect to various evaluation measures (Harmonic Mean (F1-score), precision, accuracy, etc.) when compared to several other data mining techniques. The framework creates an ensemble of classifiers (group of classifiers) and it offers a balance between stability (maintaining previously learned knowledge) and plasticity (learning knowledge from the newly offered training data set). Hence, the framework can also handle the cyclical concept drift. The classifiers that constitute the ensemble are created using an improved Self-Structuring Neural Networks algorithm (SSNN). Traditionally, NN modelling techniques rely on trial and error, which is a tedious and time-consuming process. The SSNN simplifies structuring NN classifiers with minimum intervention from the user. The framework evaluates the ensemble whenever a new data set chunk is collected. If the overall accuracy of the combined results from the ensemble drops significantly, a new classifier is created using the SSNN and added to the ensemble. Overall, the experimental results show that the proposed framework affords a balance between stability and plasticity and can effectively handle the virtual concept drift when applied to phishing websites classification problem. Most of the chapters of this thesis have been subject to publication.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.700584  DOI: Not available
Keywords: Q Science (General) ; QA75 Electronic computers. Computer science
Share: