Use this URL to cite or link to this record in EThOS: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.780612
Title: Machine learning approaches for event web data extraction
Author: Wiedmann, Julia
ISNI:       0000 0004 7966 2532
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Restricted access.
Access from Institution:
Abstract:
Web data is so ubiquitous and yet it remains so hard to get a quick answer to a simple question like "What's the closest pub with a pub quizz event going on tonight?" or "What meetups on data science are taking place this week in London?". Event search engines like Yahoo's Upcoming tried to address this challenge but failed at the long-tail problem, only integrating with few large-scale ticketing and event announcement sites. Thus, automatically extracting upcoming events (or more precisely "event announcements") from websites including the long-tail of local pubs, reading groups, or neighbourhood associations remains a crucial, but largely unsolved problem. Most general web data extraction approaches rely on the template hypothesis, that states that websites contain many pages that are generated from a background database using the same template. Thus these approaches try to re-discover the structure of the template and often induce a wrapper (or "scraper") automatically that encodes the learned patterns about the template. Unfortunately, in the event (announcement) domain the template hypothesis does not hold: Many websites only have a few active events at any moment in time and delete event announcements after the event has taken place. For such a small number of active events, often not even a consistent template is used and where it is used often the number of samples is small enough to challenge most existing web data extraction techniques. Therefore, in this thesis we focus on the problem of template-independent event data extraction, where not only no per-site supervision is necessary but also no restrictions on the template structure of the sites is imposed. Rather, the approach investigates each event page - that is a page dedicated to the announcement of a particular event - in isolation. This also allows the approach to scale out massively. More specifically, we introduce two complementary, but quite different approaches to event extraction: TIDE leverages existing semantic annotations following the schema.org vocabulary to bootstrap a training corpus. Though such annotations are only available on a small subset of event pages, they are sufficiently common to serve as a rich and diverse training set. Intensive cleaning and filtering of those annotations is unfortunately necessary to eliminate the large amount of noisy, incorrect, or abused annotations present on the web. After this filtering step, the training data is used to first classify individual web elements as potential event attributes and then assemble multiple candidates into likely events, leveraging unique features interrelating the candidates for individual event attributes, such as alignment or position on the page. The latter step is necessary as event pages often contain additional event attributes from other events, shown as "related events" or other such ancillary information. Based on an extensive empirical evaluation, TIDE extracts such fields with 93% accuracy. Unfortunately, TIDE is limited by the availability of decent semantic annotations and those are only available for few, coarse granular event attributes, namely title, date, and location. Therefore, we introduce a second approach, TIME, that uses more sophisticated feature engineering to reduce the needed training data and thus can afford the manual creation of training data for a larger and more fine granular set of attributes. We thus refer to it as multi-attribute event extraction. As the first approach, it first identifies candidates for each attribute separately, but this time with features configured for the specific attribute, e.g., through a set of gazetteers for related labels or regular expressions for detecting instance patterns, such as a date. Our empirical evaluation demonstrates that the unique characteristics of the events domain can be leveraged by encoding them into machine readable features to achieve high accuracy without massive amount of training data. We report extraction results in the range of 82-95% across a variety of attributes. This approach has been implemented at the commercial data extraction company Diffbot and is currently available as beta. It is already being used to add event information to their Knowledge Graph and is planned to be released as a public API in the next few months.
Supervisor: Gottlob, Georg ; Furche, Tim Sponsor: University of Oxford
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.780612  DOI: Not available
Keywords: Computer science
Share: