Use this URL to cite or link to this record in EThOS:
Title: The development of a framework for semantic similarity measures for the Arabic language
Author: Al-Marsoomi, Faaza Abduljabar
ISNI:       0000 0004 7658 9536
Awarding Body: Manchester Metropolitan University
Current Institution: Manchester Metropolitan University
Date of Award: 2015
Availability of Full Text:
Access from EThOS:
Access from Institution:
This thesis presents a novel framework for developing an Arabic Short Text Semantic Similarity (STSS) measure, namely that of NasTa. STSS measures are developed for short texts of 10 -25 words long. The algorithm calculates the STSS based on Part of Speech (POS), Arabic Word Sense Disambiguation (WSD), semantic nets and corpus statistics. The proposed framework is founded on word similarity measures. Firstly, a novel Arabic noun similarity measure is created using information sources extracted from a lexical database known as Arabic WordNet. Secondly, a novel verb similarity algorithm is created based on the assumption that words sharing a common root usually have a related meaning which is a central characteristic of Arabic language. Two Arabic word benchmark datasets, noun and verb are created to evaluate them. These are the first of their kinds for Arabic. Their creation methodologies use the best available experimental techniques to create materials and collect human ratings from representative samples of the Arabic speaking population. Experimental evaluation indicates that the Arabic noun and the Arabic verb measures performed well and achieved good correlations comparison with the average human performance on the noun and verb benchmark datasets respectively. Specific features of the Arabic language are addressed. A new Arabic WSD algorithm is created to address the challenge of ambiguity caused by missing diacritics in the contemporary Arabic writing system. The algorithm disambiguates all words (nouns and verbs) in the Arabic short texts without requiring any manual training data. Moreover, a novel algorithm is presented to identify the similarity score between two words belonging to different POS, either a pair comprising a noun and verb or a verb and noun. This algorithm is developed to perform Arabic WSD based on the concept of noun semantic similarity. Important benchmark datasets for text similarity are presented: ASTSS-68 and ASTSS-21. Experimental results indicate that the performance of the Arabic STSS algorithm achieved a good correlation comparison with the average human performance on ASTSS-68 which was statistically significant.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available