Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.553495
Title: Arabic named entity recognition : a corpus-based study
Author: Algahtani, Shabib Mallouh
Awarding Body: University of Manchester
Current Institution: University of Manchester
Date of Award: 2012
Availability of Full Text:
Access through EThOS:
Access through Institution:
Abstract:
The task of finding and classifying proper nouns in natural language text is the core of most Named Entity Recognition (NER) systems. The NER problem has received much attention, as NER forms the basic building block of any Information Extraction system. Although finding and classifying proper nouns in text is a very challenging task in English, the task benefits a great deal from the distinguishing orthographic feature of capitalization. When this feature is missing, as in uppercase text, or is present at the start of a sentence, ambiguity increases, and requires more knowledge sources to resolve the problem. The lack of capitalization is, however, an intrinsic feature of Arabic, thus the NER task in Arabic becomes immediately harder than in English. The ambiguity caused by this feature is moreover increased, as most Arabic proper nouns are indistinguishable from forms that are common nouns and adjectives. Thus, a lookup approach relying on proper noun dictionaries would not be an appropriate way to tackle the problem, as ambiguous tokens that fall in this category are more likely to be used as non-proper nouns in text. In addition, Arabic is a highly morphological language, thus posing more challenges for the NER task. We hypothesize that Arabic NER is very closely bound to Part-of-Speech (POS) tagging. However, Arabic POS taggers would normally have their worst accuracy on proper noun tagging, especially person names, given the problem just mentioned. Thus, we first built a POS tagging tool with a good coverage using the corpus-based approach. Then, we used a filtering technique to help collect unique proper nouns from large gazetteers. Combined with the POS, gazetteer, and unique names list features, we defined and used a further set of features to build a corpus-based NER classifier from labelled data. Experiments on different datasets, against a baseline and incorporating different combinations of features, resulted in demonstrating the efficiency of our final set of proposed features. The unique names list moreover assisted in reducing, in particular, the POS feature's noise on proper nouns. Evaluation of our approach shows that it performs comparably with systems that use more, and more sophisticated, knowledge sources, and hence is easier to deploy for practical use.
Supervisor: Mcnaught, John Sponsor: Saudi Government
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.553495  DOI: Not available
Share: