Use this URL to cite or link to this record in EThOS:
Title: Data extraction & semantic annotation from web query result pages
Author: Anderson, Neil David Alan
ISNI:       0000 0004 6060 9547
Awarding Body: Queen's University Belfast
Current Institution: Queen's University Belfast
Date of Award: 2016
Availability of Full Text:
Access from EThOS:
Our unquenchable thirst for knowledge is one of the few things that really defines our humanity. Yet the Information Age, which we have created, has left us floating aimlessly in a vast ocean of unintelligible data. Hidden Web databases are one massive source of structured data. The contents of these databases are, however, often only accessible through a query proposed by a user. The data returned in these Query Result Pages is intended for human consumption and, as such, has nothing more than an implicit semantic structure which can be understood visually by a human reader, but not by a computer. This thesis presents an investigation into the processes of extraction and semantic understanding of data from Query Result Pages. The work is multi-faceted and includes at the outset, the development of a vision-based data extraction tool. This work is followed by the development of a number of algorithms which make use of machine learning-based techniques first to align the data extracted into semantically similar groups and then to assign a meaningful label to each group. Part of the work undertaken in fulfilment of this thesis has also addressed the lack of large, modern datasets containing a wide range of result pages representing of those typically found online today. In particular, a new innovative crowdsourced dataset is presented. Finally, the work concludes by examining techniques from the complementary research field of Information Extraction. An initial, critical assessment of how these mature techniques could be applied to this research area is provided.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available