Use this URL to cite or link to this record in EThOS:
Title: Unsupervised detection of anomalous text
Author: Guthrie, David
ISNI:       0000 0004 2672 0436
Awarding Body: University of Sheffield
Current Institution: University of Sheffield
Date of Award: 2008
Availability of Full Text:
Access from EThOS:
This thesis describes work on the detection of anomalous material in text without the use of training data. We use the term anomalous to refer to text that is irregular, or deviates signihcantly from its surrounding context. In this thesis we show to identifying such abnormalities in text can be viewed as a type of outlier detection because these anomahes will differ significantly from the writing style in the majority We consider segments of text, which are anomalous with respect to topic about a different subject, author (written by a different person), or genre (written for a different audience or from a different source) and experiment with whether it is possible to identify these anomalous segments automatically. Five different innovative approaches to this problem are introduced and assessed using many experiments ver large document collections, created to contain randomly inserted anomalous segments. In order to identify anomalies in text successfully, we investigate and evaluate 166 stylistic and linguistic features used to characterize writing, some of which are well-established stylistic determiners, but many of which are original. Using these features with each of our methods, we examine the effect of segment size on our ability to detect anomaly, allowing segments of size 100 words, 500 words and 1000 words. We show substantial improvements over a baseline in all cases for all methods, a novel method which performs consistently better than others and the features that contribute most to unsupervised anomaly detection.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available