Use this URL to cite or link to this record in EThOS:
Title: Web relation extraction with distant supervision
Author: Augenstein, Isabelle
ISNI:       0000 0004 5991 5264
Awarding Body: University of Sheffield
Current Institution: University of Sheffield
Date of Award: 2016
Availability of Full Text:
Access from EThOS:
Access from Institution:
Being able to find relevant information about prominent entities quickly is the main reason to use a search engine. However, with large quantities of information on the World Wide Web, real time search over billions of Web pages can waste resources and the end user’s time. One of the solutions to this is to store the answer to frequently asked general knowledge queries, such as the albums released by a musical artist, in a more accessible format, a knowledge base. Knowledge bases can be created and maintained automatically by using information extraction methods, particularly methods to extract relations between proper names (named entities). A group of approaches for this that has become popular in recent years are distantly supervised approaches as they allow to train relation extractors without text-bound annotation, using instead known relations from a knowledge base to heuristically align them with a large textual corpus from an appropriate domain. This thesis focuses on researching distant supervision for the Web domain. A new setting for creating training and testing data for distant supervision from the Web with entity-specific search queries is introduced and the resulting corpus is published. Methods to recognise noisy training examples as well as methods to combine extractions based on statistics derived from the background knowledge base are researched. Using co-reference resolution methods to extract relations from sentences which do not contain a direct mention of the subject of the relation is also investigated. One bottleneck for distant supervision for Web data is identified to be named entity recognition and classification (NERC), since relation extraction methods rely on it for identifying relation arguments. Typically, existing pre-trained tools are used, which fail in diverse genres with non-standard language, such as the Web genre. The thesis explores what can cause NERC methods to fail in diverse genres and quantifies different reasons for NERC failure. Finally, a novel method for NERC for relation extraction is proposed based on the idea of jointly training the named entity classifier and the relation extractor with imitation learning to reduce the reliance on external NERC tools. This thesis improves the state of the art in distant supervision for knowledge base population, and sheds light on and proposes solutions for issues arising for information extraction for not traditionally studied domains.
Supervisor: Fabio, Ciravegna ; Diana, Maynard Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available