Use this URL to cite or link to this record in EThOS:
Title: Named entity recognition : challenges in document annotation, gazetteer construction and disambiguation
Author: Zhang, Ziqi
ISNI:       0000 0004 2736 6897
Awarding Body: University of Sheffield
Current Institution: University of Sheffield
Date of Award: 2013
Availability of Full Text:
Access from EThOS:
Access from Institution:
The 'information explosion' has generated unprecedented amount of published information that is still growing at an astonishing rate. As the amount of information grows, the problem of managing the information becomes challenging. A key to this challenge rests on the technology of Information Extraction, which automatically transforms un-structured textual data into structured representation that can be interpreted and manipulated by machines. It is recognised that a fundamental task in Information Extraction is Named Entity Recognition, the goals of which are identifying references of named entities in unstructured documents, and classifying them into pre-defined semantic categories. Further, due to the polysemous nature of natural language, name references are often ambiguous. Resolving ambiguity concerns recognising the true referent entity of a name reference, essentially a further named entity 'recognition' step and often a compulsory process required by tasks built on top of NER. This research presents a body of work aimed at addressing three research questions for NER. The first question concerns effective and efficient methods for training data annotation, which is the task of creating essential training examples for machine learning based NER methods. The second question studies automatically generating background knowledge for NER in the form of gazetteers, which are often critical resources to improve the performance of NER methods. The third question addresses resolving ambiguous name references, a further 'recognition' step that ensures the output of NER to be usable by many complex tasks and applications. For each research question, the related literature has been carefully studied and their limitations have been identified and discussed. New hypotheses and methods have been pro-posed, leading to a number of contributions: - an approach to training data annotation for supervised NER methods, based on the study of annotator suitability and suitability based task allocation; - a method of automatically expanding existing gazetteers of pre-defined semantic categories exploiting the structure and knowledge of Wikipedia; - a method of automatically generating untyped gazetteers for NER based on the 'topic-representativeness' of words in documents; - a method of named entity disambiguation based on maximising the semantic relatedness between candidate entities in a text discourse; - a review of lexical semantic relatedness measures; and a new lexical semantic relatedness measure that harnesses knowledge from different resources. The proposed methods have been evaluated by carefully designed experiments, following the standard practice in each related research area. The results have confirmed the validity of their corresponding hypotheses, as well as the empirical effectiveness of these methods. Overall it is believed that this research has made solid contribution to the re-search of NER and related areas.
Supervisor: Ciravegna, Fabio Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available