Use this URL to cite or link to this record in EThOS:
Title: Data mining in text streams using suffix trees
Author: Snowsill, Tristan
ISNI:       0000 0004 2725 0747
Awarding Body: University of Bristol
Current Institution: University of Bristol
Date of Award: 2012
Availability of Full Text:
Access from EThOS:
Data mining in text streams, or text stream mining, is an increasingly im- portant topic for a number of reasons, including the recent explosion in the availability of textual data and an increasing need for people and organi- sations to process and understand as much of that information as possible, from single users to multinational corporations and governments. In this thesis we present a data structure based on a generalised suffix tree which is capable of solving a number of text stream mining tasks. It can be used to detect changes in the text stream, detect when chunks of text are reused and detect events through identifying when the frequencies of phrases change in a statistically significant way. Suffix trees have been used for many years in the areas of combinatorial pattern matching and computational genomics. In this thesis we demonstrate how the suffix tree can become more widely applicable by making it possible to use suffix trees to analyse streams of data rather than static data sets, opening up a number of future avenues for research. The algorithms which we present are designed to be efficient in an on-line setting by having time complexity independent of the total amount of text seen and polynomial in the rate at which text is seen. We demonstrate the effectiveness of our methods on a large text stream comprising thousands of documents every day. This text stream is the stream of text news coming from over 600 online news outlets and the results ob- tained are of interest to news consumers, journalists and social scientists.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available