Use this URL to cite or link to this record in EThOS:
Title: A teachable semi-automatic web information extraction system based on evolved regular expression patterns
Author: Siau, Nor Zainah
ISNI:       0000 0004 5354 5194
Awarding Body: Loughborough University
Current Institution: Loughborough University
Date of Award: 2014
Availability of Full Text:
Access from EThOS:
Access from Institution:
This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: TS-WIE ; Dynamic grammar definition ; Genetic programming ; Regular expressions pattern and structural pattern (DOM).