Classification rule induction on large datasets is a major challenge in the field of
data mining in a world where massive amounts of data are recorded on a large
scale. There are two main approaches to classification rule induction; the 'divide
and conquer' approach and the 'separate and conquer' approach. Even though
both approaches deliver a comparable classification accuracy, they differ when it
comes to rule representation and quality of rules in certain circumstances. There
is the intuitive representation of classification rules in the form of a tree when
using the 'divide and conquer' approach which is easy to assimilate by humans.
However, modular rules induced by the 'separate and conquer' approach generally
perform better in environments where the training data of the classifier is
noisy or contains clashes. The term 'modular rules' is used to mean any set of
rules describing some domain of interest. They will generally not fit together
naturally in a decision tree. Both approaches are challenged by increasingly large
volumes of data. There have been several attempts to scale up the 'divide and
conquer' approach, however there is very little work on scaling up the 'separate
and conquer' approach. One general approach is to use supercomputers with
faster hardware to process these huge amounts of data, yet modest-sized organisations
may not be able to afford such hardware. However most organisations
have local computer workstations that they use for many applications such as word processing or spreadsheets. These computer workstations are usually connected
in a local network and mainly used during normal working hours and are
usually idle overnight and at weekends. During these idle times these computer
workstations connected in a network could be used for data mining applications
on large datasets.
This research focuses on a cheap solution for modest sized organisations that
cannot afford fast supercomputers. For this reason this work aims to utilise the
computational power and memory of a network of workstations. In this research a
novel framework for scaling up modular classification rule induction is presented,
based on a distributed blackboard architecture. The framework is called PMCRI
(Parallel Modular Classification Rule Inducer). It provides an underlying communication
infrastructure for parallelising a whole family of modular classification
rule induction algorithms: the Prism family. Experimental results obtained show
a good scale up behaviour on various datasets and thus confirm the success of
PMCRI.
|