Use this URL to cite or link to this record in EThOS:
Title: Automatic term extraction and text categorisation
Author: Manomaisupat, Pensiri
Awarding Body: University of Surrey
Current Institution: University of Surrey
Date of Award: 2006
Availability of Full Text:
Access from EThOS:
Access from Institution:
Automatic text categorisation is a major challenge for information retrieval, information extraction, and semantic web projects. The categorisation of texts depends on the 'meaning' of the individual texts - texts sharing the same meaning should be categorised together and those with different meaning will be categorised separately. This is an intelligent task and requires the knowledge of a given domain and expertise in text categorisation. Meaning is expressed by using keywords in a specialist domain; these keywords can change over time and new keywords are added and old ones removed. I present a method where keywords are extracted automatically from large collection of texts and the keywords are then used to train neural computing systems - in a limited way the systems 'learn' to categorise. A number of different techniques of extracting keywords are presented. The keywords were extracted using the traditional tfidf metric, a technique used in contrastive corpus linguistics - weirdness; multi-word compounds have been used as well as vectors for text collections. For 'learning' algorithms, we used the unsupervised self-organising maps and the supervised support vector machines - both have been used for the purposes of dimension reduction that is mapping from a highdimensional feature space to a lower dimensional without much loss of accuracy. The performance of such systems is evaluated through classification accuracy and average quantisation error. Three large text collections were used for training and testing - the TREC-AP news wire, the Reuters RCV1 and streaming news from Reuters Financial - the focus of the experimentation was on financial news. An archetype was developed that incorporates text analysis, terminology management, neural computing and feature vector generation systems. A novel evaluation scheme is reported where a vector of randomly selected words from a text collection is used as a baseline. The other comparisons are between systems trained by different techniques and with different learning algorithms. The key results include the classification accuracy is highest when the compound terms are chosen for creating vectors - the compounds were extracted automatically - however, when a terminology-based method was used to create vectors the single words from this method appear to be a better vector for training. The results of the experiments are encouraging. With further research, improvements in quantitative performance can be expected in the future.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available