OXPath : a scalable, memory-efficient formalism for data extraction from modern web applications
The evolution of the web has outpaced itself: The growing wealth of informa- tion and the increasing sophistication of interfaces necessitate automated pro- cessing. Web automation and extraction technologies have been overwhelmed by this very growth. To a'ddress this trend, we identify four key requirements of web extraction: (1) Interact with sophisticated web application interfaces, (2) Precisely capture the relevant data for most web extraction tasks, (3) Scale with the number of visited pages, and (4) Readily embed into existing web technologies. ThIS dissertation discusses OXPATH, an extension of XPath for interacting with web applications and for extracting information thus revealed. It ad- -: dresses all the above requirements. OXPATH's page-at-a-time evaluation guar- antees memory use independent of the number of visited pages, yet remains polynomial in time. We validate experimentally the theoretical complexity and demonstrate that its evaluation is dominated by technical aspects such as the page rendering of the underlying browser. We also present OXPATH host languages, including Ox LATIN. Ox LATIN extends the well-known Pig Latin language and can run on a standard Hadoop cluster. The Ox LATIN language facilitates distributed expression evaluation in a cloud computing paradigm, providing support for common web extraction scenarios that include expression composition, aggregation, and integration. Ox LATIN adds support for continuations within its programs, which increases its efficiency by eliminating unneeded page fetches. Our experiments confirm the scalability of OXPATH and Ox LATIN. We fur- ther show that OXPATH outperforms existing commercial and academic data extraction tools by a wide margin. OXPATH is available under an open source license. We also discuss applications and ongoing tool development that establish OX- PATH as a data extraction tool that advances the state-of-the-art.