Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.601723
Title: New techniques for Arabic document classification
Author: Chantar, Hamouda Khalifa Hamouda
Awarding Body: Heriot-Watt University
Current Institution: Heriot-Watt University
Date of Award: 2013
Availability of Full Text:
Access from EThOS:
Access from Institution:
Abstract:
Text classification (TC) concerns automatically assigning a class (category) label to a text document, and has increasingly many applications, particularly in the domain of organizing, for browsing in large document collections. It is typically achieved via machine learning, where a model is built on the basis of a typically large collection of document features. Feature selection is critical in this process, since there are typically several thousand potential features (distinct words or terms). In text classification, feature selection aims to improve the computational e ciency and classification accuracy by removing irrelevant and redundant terms (features), while retaining features (words) that contain su cient information that help with the classification task. This thesis proposes binary particle swarm optimization (BPSO) hybridized with either K Nearest Neighbour (KNN) or Support Vector Machines (SVM) for feature selection in Arabic text classi cation tasks. Comparison between feature selection approaches is done on the basis of using the selected features in conjunction with SVM, Decision Trees (C4.5), and Naive Bayes (NB), to classify a hold out test set. Using publically available Arabic datasets, results show that BPSO/KNN and BPSO/SVM techniques are promising in this domain. The sets of selected features (words) are also analyzed to consider the di erences between the types of features that BPSO/KNN and BPSO/SVM tend to choose. This leads to speculation concerning the appropriate feature selection strategy, based on the relationship between the classes in the document categorization task at hand. The thesis also investigates the use of statistically extracted phrases of length two as terms in Arabic text classi cation. In comparison with Bag of Words text representation, results show that using phrases alone as terms in Arabic TC task decreases the classification accuracy of Arabic TC classifiers significantly while combining bag of words and phrase based representations may increase the classification accuracy of the SVM classifier slightly.
Supervisor: Corne, David Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.601723  DOI: Not available
Share: