Use this URL to cite or link to this record in EThOS:
Title: The Corpus Expansion Toolkit : finding what we want on the web
Author: Pay, Jack Frederick
ISNI:       0000 0004 9353 3251
Awarding Body: University of Sussex
Current Institution: University of Sussex
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example ‘seed' documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable: 1. Domain known – source known 2. Domain known – source unknown 3. Domain unknown – source unknown First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA0075 Electronic computers. Computer science ; TK7885 Computer engineering. Computer hardware ; ZA4150 Computer network resources