Title:

Statistical methods for monitoring multiple data streams

This thesis develops new methods to monitor multiple data streams and report some quantity of interest over time. We consider two types of settings. First, we consider a data stream as realisations from a sequence of independent random variables that are revealed over time. To monitor the individual streams, we propose a new type of control chart, based on the cumulative sum chart. Cumulative sum charts are typically used to detect a change in the distribution of a sequence of observations, e.g., shifts in the mean. Usually, after signalling, the chart is restarted by setting it to some value below the signalling threshold. We propose a nonrestarting cumulative sum chart which is able to detect periods during which the stream is out of control. Further, we advocate an upper boundary to prevent the cumulative sum chart rising too high, which helps to detect a change back into control. We prove that the nonrestarting charts are optimal, in a welldefined sense. Further, we investigate the performance of these charts when the upper boundary is varied. Simulation results show a tradeoff between the height of the upper boundary of the chart and the false signal rate. We then present an algorithm to control the false discovery rate across multiple data streams using the nonrestarting charts. We consider two definitions of a false discovery: signalling outofcontrol when the observations have been incontrol since the start and signalling outofcontrol when the observations have been incontrol since the last time the chart was at zero. We prove that the false discovery rate is controlled under both these definitions simultaneously. Simulations reveal the difference in false discovery rate control when using these and other desirable definitions of a false discovery. In the second setting, a data stream is considered as observations of a Bayesian model revealed over time. The aim is to report a posterior summary of interest quickly and within a userspecified degree of accuracy. A system is presented to tackle such problems. The estimates are calculated using weighted samples stored in a database. The stored samples are maintained such that the accuracy of the estimates and quality of the samples is satisfactory. This maintenance involves varying the number of samples in the database and updating their weights. New samples are generated, when required, by a Markov chain Monte Carlo algorithm. The system is demonstrated using a football league model that is used to predict the end of season table. Correctness of the estimates and their accuracy is shown in a simulation using a linear Gaussian model. Lastly, potential improvements of the system are investigated. A series of motivating simulations illustrate some potential problems of the system. Remedial solutions are suggested, with a view toward implementation in the near future.
