Use this URL to cite or link to this record in EThOS:
Title: Automated domain-aware form understanding with OPAL, with a case study in the UK real-estate domain
Author: Guo, Xiaonan
ISNI:       0000 0004 2729 7123
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2012
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Web forms are the interfaces to the deep web, and automated form understanding is the key to unlock its contents. It is a fundamental problem in many applications and research fields, such as deep web crawling, data integration, or information extraction. It is also essential for improving web usability and accessibility. Form understanding is an inherently empirical problem. Existing form understanding approaches are restricted by exploiting limited and domain independent feature sets leading to overly generic and monolithic algorithms. In response, we present OPAL (Ontology based web Pattern Analysis with Logic), a domain-aware form understanding approach, that addresses all these limitations through a novel multi-scope approach. OPAL achieves this through a domain independent form labeling and a domain dependent form interpretation. In form labeling, OPAL associates texts with fields as labels through three domain independent scopes exploiting textual, structural, and visual information. In form interpretation, OPAL integrates the form labeling obtained with a layer of high-level domain knowledge to classify form fields and to repair the form model. To ease the task of designing domain schemata, we develop the template language OPAL-TL to express domain types and their structural constraints. With OPAL-TL, we describe common design patterns as templates maintained in a library. Thus, the adaption to new domains often requires only instantiation of the templates with corresponding domain types. We conduct extensive experiments, that cover both domain independent cross-domain testing with standard form understanding benchmarks, and a domain-aware evaluation with two domain datasets randomly selected from real estate and used car domain. OPAL outperforms previous works by a significant margin and pushes the state of the art to near perfect accuracy (> 98%). In an effort to integrate OPAL with an entire data extraction pipeline, we plan to extend OPAL with form probing and to exploit information obtained by other data extraction components, e.g., result page analysis.
Supervisor: Gottlob, Georg Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Computing ; Software engineering ; Applications and algorithms