Use this URL to cite or link to this record in EThOS:
Title: Active expert sourcing : knowledge extraction from domain specific information
Author: Alghamdi, Ans
Awarding Body: University of Essex
Current Institution: University of Essex
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Thesis embargoed until 05 Sep 2024
Access from Institution:
The development of Named Entity Recognition (NER) in recent years is partially attributed to the availability of annotated ata-sets. Data-sets play a crucial part indeveloping, training, and testing NER algorithms. The need for data-sets becomes more important when adapting the algorithms to new domains. However, domain specific information imposes different challenges on NERs, such as the need for annotating a different set of Named Entity (NE) types (e.g. NE schema) or, more importantly, the need for domain expert annotators. Many domain specific NER use academic paper-sharing platforms as sources for data-sets. Either abstracts or the full texts of publications are extracted from the platforms to construct raw data-sets. These raw data-sets are then annotated by domain experts. However, expert annotation is an expensive process and consumes more resources compared to non-expert annotation. This thesis tackles the problem of adapting NER to new domains and focuses on reducing the resources needed to create domain specific NER. In this thesis, academic paper-sharing portals are used as a source for raw data and also as a source for finding annotators. In other words, paper-sharing platforms are used as a crowdsourcing platform, and the scholars who share their publications are asked to annotate their own work. This thesis uses also active learning (AL) to further reduce the resources needed to develop NER. In the introduced approach, experts submit their papers online. The papers then go through a Natural Language Processing (NLP) pipeline that prepares the papers' text for annotating. An active learning algorithm, as part of this pipeline, selects the most informative instances to be annotated. The author is then asked to annotate these instances. The developed NER approach is in a consistent loop. The loop is used to produce more annotated resources and to improve the NER model. Two empirical experiments are conducted: one is a real-world experiment, and the other is a simulation. The real-world experiment tackles the archaeological domain. In this experiment, an NER is developed for two languages: English and Italian. The second experiment is in the biomedical domain, and an already annotated data-set is used to simulate the approach presented in this thesis. The results of the experiments suggest that the approach used in this thesis is a promising candidate for developing domain specific NER, as it achieved results that are significantly higher than the baseline interm of the F-score.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA75 Electronic computers. Computer science