Use this URL to cite or link to this record in EThOS:
Title: Semantic based indexing technique for optimisation and intelligent document representation : application to structured and unstructured document clustering
Author: Barresi, Simona
ISNI:       0000 0004 2697 7047
Awarding Body: University of Salford
Current Institution: University of Salford
Date of Award: 2010
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Thesis embargoed until 31 Jul 2022
Access from Institution:
The advances in data collection and the increasing amount of unstructured and unlabeled text documents have led to the need for better disambiguation and indexing techniques, which allow for the effective and intelligent organisation of large amounts of documents into a small number of significant clusters; facilitating the analysis, browsing, and searching of document collections. Traditionally, document clustering systems have relied on bag-of-words and term frequency approaches to represent and subsequently classify documents, by only taking into account document syntax and with no consideration for semantic aspects. To address this issue, more complex indexing and clustering techniques, which consider the semantic associations between the words contained in a document and differentiate the degree of semantic importance of terms during the classification process, need to be further investigated in order to enable appropriate and automatic contextualisation of text documents and information. This research proposes a new indexing technique, which can be used to effectively represent, and subsequently cluster, collections of unstructured or structured documents. The presented technique aims at overcoming some of the major problems related to the bag-of-words approach; such as its lack of consideration for synonyms as well as its usual failure in differentiating the degree of semantic importance of terms. The main idea behind the proposed technique is to map each document into a lower dimensional space; by considering the semantic associations between the words contained in the document. To address the semantic problems posed by traditional indexing, the investigated method focuses on word sense disambiguation and document concepts. The proposed technique extracts concepts from documents and uses a set of these concepts as indexing units, achieving vector dimensionality reduction as well as more cohesive and separated clusters. Good results are also achieved in terms of purity, entropy, and when compared with similar studies in the field of semantic-based concept indexing.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available