Use this URL to cite or link to this record in EThOS:
Title: Improved rule-based document representation and classification using genetic programming
Author: Soltan-Zadeh, Yasaman
ISNI:       0000 0004 2717 8036
Awarding Body: Royal Holloway, University of London
Current Institution: Royal Holloway, University of London
Date of Award: 2011
Availability of Full Text:
Access from EThOS:
Access from Institution:
In the field of information retrieval and in particular classification, the mathematical and statistical rules and classifiers are not human readable. Non-human readable rules and classifiers act as a barrier in utilizing "expert knowledge" to improve results. Such barriers can be overcome using genetic programming. The aim of this thesis is to produce classifiers and in particular document representatives which are human readable using genetic programming. Human readability makes these representatives more interactive and adaptable by providing the possibility of integrating expert knowledge. Genetic programming as anon-deterministic method with high flexibility is among the best options to produce human readable document representatives. To test the results of the chosen method, standard test collections are used. These standard test collections guarantee that the experiments are replicable and the results are reproducible by other researchers. Thisthesisdemonstratestheprocessofproducinghumanreadabledocumentrepresentatives with transparency for further modification and analysis by expert knowledge, while retaining the performance. To obtain these findings, this thesis has contributed to the field by developing a system that introduces a novel tree structure to improve the feature selection process, and a novel fitness function to improve the quality of representative generator. To produce a human readable representative the tree structure is changed into a new shape with more control on the number of children. This reduces the depth of each tree for certain number of features and results in a flatter structure. A fitness function is constructed by combination of classification accuracy on training and validation sets and a parsimony component. This study found that the order of matched document with representatives can improve overall performance. Different feature selections are investigated and integrated into our genetic programming based feature selection method which is based on a probability distribution derived from the feature weights.
Supervisor: Saeedi, Masoud ; Jashapara, Ashok Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: information retrieval ; machine learning ; Genetic programming ; classification