Use this URL to cite or link to this record in EThOS:
Title: OXPath : a scalable, memory-efficient formalism for data extraction from modern web applications
Author: Sellers, Andrew
ISNI:       0000 0004 2721 6549
Awarding Body: Oxford University
Current Institution: University of Oxford
Date of Award: 2011
Availability of Full Text:
Full text unavailable from EThOS.
Please contact the current institution’s library for further details.
The evolution of the web has outpaced itself: The growing wealth of informa- tion and the increasing sophistication of interfaces necessitate automated pro- cessing. Web automation and extraction technologies have been overwhelmed by this very growth. To a'ddress this trend, we identify four key requirements of web extraction: (1) Interact with sophisticated web application interfaces, (2) Precisely capture the relevant data for most web extraction tasks, (3) Scale with the number of visited pages, and (4) Readily embed into existing web technologies. ThIS dissertation discusses OXPATH, an extension of XPath for interacting with web applications and for extracting information thus revealed. It ad- -: dresses all the above requirements. OXPATH's page-at-a-time evaluation guar- antees memory use independent of the number of visited pages, yet remains polynomial in time. We validate experimentally the theoretical complexity and demonstrate that its evaluation is dominated by technical aspects such as the page rendering of the underlying browser. We also present OXPATH host languages, including Ox LATIN. Ox LATIN extends the well-known Pig Latin language and can run on a standard Hadoop cluster. The Ox LATIN language facilitates distributed expression evaluation in a cloud computing paradigm, providing support for common web extraction scenarios that include expression composition, aggregation, and integration. Ox LATIN adds support for continuations within its programs, which increases its efficiency by eliminating unneeded page fetches. Our experiments confirm the scalability of OXPATH and Ox LATIN. We fur- ther show that OXPATH outperforms existing commercial and academic data extraction tools by a wide margin. OXPATH is available under an open source license. We also discuss applications and ongoing tool development that establish OX- PATH as a data extraction tool that advances the state-of-the-art.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available