Use this URL to cite or link to this record in EThOS:
Title: Adaptive estimation of categorical data streams with applications in change detection and density estimation
Author: Plasse, Joshua
ISNI:       0000 0004 8499 6561
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
The need for efficient tools is pressing in the era of big data, particularly in applications that generate data streams -- unbounded sequences of observations, which arrive at high-frequency and are subject to unknown changes in their data generating process. These temporal changes are colloquially referred to as drift, and require methodology to dynamically reconfigure parameters whenever changes occur. This is achieved by incorporating forgetting factors into the estimation process, and the contributions of this work develop methodology for two under-researched areas of streaming inference, mainly: multiple changepoint detection for categorical data streams, and streaming density estimation using histograms. Currently there is a dearth of literature devoted to change detection in categorical data streams, and the existing work typically introduces fixed parameters without providing insight into how to specify them. This is ill-suited to the streaming paradigm, motivating the need for approaches that introduce few parameters, which may be set without requiring prior knowledge of the stream. The first novel contribution of this thesis is a family of multinomial change detection methods (MCDMs), which assumes the observations are independent. These detectors adaptively monitor the category probabilities of a multinomial distribution, where temporal adaptivity is introduced using forgetting factors. A novel adaptive thresholding technique is also developed, which can be computed given a desired false positive rate. The observations are then assumed to satisfy a first-order Markov property, and an adaptive detection and estimation procedure for transition matrices (ADEPT-M) is developed. This detector is based on a moment matching technique and effectively monitors for multiple changepoints in a transition matrix without making any assumptions on the number of changepoints, nor their magnitudes. The performance of the MCDMs and ADEPT-M is investigated via large simulation and real-application studies, which verifies the usefulness of our approaches. The final contribution of this thesis are temporally adaptive streaming histograms (TASHs) which are suitable for univariate, continuous-valued, data streams. Existing methods typically construct histograms to have equal width bins, which implicitly assumes knowledge of the range of the random variables, or how the data stream evolves over time. This is an unrealistic assumption, and in this thesis the bin widths are allowed to change over time, and are computed in a data-driven manner. Novel methodology is also developed for merging and splitting existing bins, allowing for the accurate estimation of a data stream's drifting distribution. Several other quantities that are useful in streaming inference can be adaptively estimated with little increase in computation directly from the TASHs. This includes techniques for streaming quantile estimation, non-parametric change detection, data stream comparison, ROC curves and outlier detection.
Supervisor: Adams, Niall Sponsor: Imperial College London
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral