Use this URL to cite or link to this record in EThOS:
Title: Characterising semantically coherent classes of text through feature discovery
Author: Robertson, Andrew David
Awarding Body: University of Sussex
Current Institution: University of Sussex
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
There is a growing need to provide support for social scientists and humanities scholars to gather and "engage" with very large datasets of free text, to perform very bespoke analyses. method52 is a text analysis platform built for this purpose (Wibberley et al., 2014), and forms a foundation that this thesis builds upon. A central part of method52 and its methodologies is a classifier training component based on dualist (Settles, 2011), and the general process of data engagement with method52 is determined to constitute a continuous cycle of characterising semantically coherent sub-collections, classes, of the text. Two broad methodologies exist for supporting this type of engagement process: (1) a top-down approach wherein concepts and their relationships are explicitly modelled for reasoning, and (2) a more surface-level, bottom-up approach, which entails the use of key terms (surface features) to characterise data. Following the second of these approaches, this thesis examines ways of better supporting this type of data engagement to more effectively support the needs of social scientists and humanities scholars in engaging with text data. The classifier component provides an active learning training environment emphasising the labelling of individual features. However, it can be difficult to interpret and incorporate prior knowledge of features. The process of feature discovery based on the current classifier model does not always produce useful results. And understanding the data well enough to produce successful classifiers is timeconsuming. A new method for discovering features in a corpus is introduced, and feature discovery methods are explored to resolve these issues. When collecting social media data, documents are often obtained by querying an API with a set of key phrases. Therefore, the set of possible classes characterising the data is defined by these basic surface features. It is difficult to know exactly which terms must searched for, and the usefulness of terms can change over time as new discussions and vocabulary emerge. Building on the feature discovery techniques, a framework is presented in this thesis for streaming data with an automatically adapting query to deal with these issues.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: AZ0195 Electronic information resources. Including computer network resources, the Internet, digital libraries, etc. ; Q0387.5 Semantic networks ; QA076.9.D343 Data mining ; Z0696 Classification and notation