Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.565960
Title: Automated domain-aware form understanding with OPAL : with a case study in the UK real-estate domain
Author: Guo, Xiaonan
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2012
Availability of Full Text:
Full text unavailable from EThOS.
Please contact the current institution’s library for further details.
Abstract:
Web forms are the interfaces to the deep web, and automated form under- standing is the key to unlock its contents. It is a fundamental problem in many applications and research fields, such as deep web crawling, data in- tegration, or information extraction. It is also essential for improving web usability and accessibility. Form understanding is an inherently empirical problem. Existing form un- derstanding approaches are restricted by exploiting limited and domain inde- pendent feature sets leading to overly generic and monolithic algorithms. In response, we present OPAL (Ontology based web Pattern Analysis with Logic), a domain-aware form understanding approach, that addresses all these lim- itations through a novel multi-scope approach. OPAL achieves this through a domain independent form labeling and a domain dependent form interpre- tation. In form labeling, OPAL associates texts with fields as labels through three domain independent scopes exploiting textual, structural, and visual information. In form interpretation, OPAL integrates the form labeling ob- tained with a layer of high-level domain knowledge to classify form fields and to repair the form model. To ease the task of designing domain schemata, we develop the template lan- guage OPAL-TL to express domain types and their structural constraints. With OPAL-TL, we describe common design patterns as templates maintained in a library. Thus, the adaption to new domains often requires only instantiation of the templates with corresponding domain types. We conduct extensive experiments, that cover both domain independent cross- domain testing with standard form understanding benchmarks, and a domain- aware evaluation with two domain datasets randomly selected from real estate and used car domain. OPAL outperforms previous works by a significant mar- gin and pushes the state of the art to near perfect accuracy (> 98%). In an effort to integrate OPAL with an entire data extraction pipeline, we plan to extend OPAL with form probing and to exploit information obtained by other data extraction components, e.g., result page analysis.
Supervisor: Gottlob, Georg Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.565960  DOI: Not available
Keywords: Parallel processing (Electronic computers) ; Programming languages (Electronic computers) ; Invisible Web
Share: