Artificial immune systems for Web content mining : focusing on the discovery of interesting information
This thesis explores the way in which biological metaphors can be applied to web content mining and, more specifically, the identification of interesting information in web documents. Web content mining is the use of content found on the web, most usually the text found on web pages, for data mining tasks such as classification. Due to the nature of the search domain, i.e. the web content is noisy and undergoing constant change, an adaptive system is required. The discovery of interesting information is an advance on basic text mining in that it aims to identify text that is novel, unexpected or surprising to a user, whilst still being relevant. This thesis investigates the use of Artificial Immune Systems (AIS) applied to discovery of interesting information as AIS are thought to confer the adaptability and learning required for this task. Two novel Artificial Immune Systems are described and tested. AISEC (Artificial Immune System for Interesting E-mail Classification) is a novel, immune inspired system for the classification of e-mail. It is shown that AISEC performs with a predictive accuracy comparable to a naïve Bayesian algorithm when continually classifying e-mail collected from a real user. This section contributes to the understanding of how AIS react in a continuous learning scenario. Following from the knowledge gained by testing AISEC, AISIID (Artificial Immune system for Interesting Information Discovery) is then described. A study involving the subjective evaluation of the results by users is undertaken and AISIID is seen to discover pages rated more interesting by users than a comparative system. The results of this study also reveal AISIID performs with subjective quality similar to the well known search engine, Google. This leads to a contribution regarding a better understanding of the user's perception of interestingness and possible inadequacies in the current understanding of interestingness regarding text documents.