Title:

Methods for the investigation of spatial clustering, with epidemiological applications

When analysing spatial data, it is often of interest to investigate whether or not the events under consideration show any tendency to form small aggregations, or clusters, that are unlikely to be the result of random variation. For example, the events might be the coordinates of the address at diagnosis of cases of a malignant disease, such as acute lymphoblastic leukaemia or nonHodgkin's lymphoma. This thesis considers the usefulness of methods employing nonparametric kernel density estimation for the detection of clustering, as defined above, so that specific, and sometimes limiting, alternative hypotheses are not required, and the continuous spatial context of the problem is maintained. Two approaches, in particular, are considered; first, a generalisation of the Scan Statistic to two dimensions, with a correction for spatial heterogeneity under the null hypothesis, and secondly, a statistic measuring the squared difference between kernel estimates of the probability density functions of the principal events and a sample of controls. Chapter 1 establishes the background for this work, and identifies four different families of techniques that have been proposed, previously, for the study of clustering. Problems inherent in typical applications are discussed, and then used to motivate the approach taken subsequently. Chapter 2 describes the Scan Statistic for a onedimensional problem, assuming that the distribution of events under the null hypothesis is uniform. A number of approximations to the statistic's distribution and methods of calculating critical values are compared, to enable significance testing to be carried out with minimum effort. A statistic based on the supremum of a kernel density estimate is also suggested, but an empirical study demonstrates that this has lower power than the Scan Statistic. Chapter 3 generalises the Scan Statistic to two dimensions and demonstrates empirically that existing bounds for the upper tail probability are not sufficiently sharp for significance testing purposes. As an aside, the chapter also describes a problem that can occur when a single pseudorandom number generator is used to produce parallel streams of uniform deviates. Chapter 4 investigates a method, suggested by Weinstock (1981), of correcting for a known, nonuniform null distribution when using the Scan Statistic in one dimension, and proposes that a kernel estimator replace the exact density, the estimate being calculated from a second set of (control) observations. The approach is generalised to two dimensions, and approximations are developed to simplify the computation required. However, simulation results indicate that the accuracy of these approximations is often poor, so an alternative implementation is suggested. For the case where two samples of observations are available, the events of interest and a group of control locations. Chapter 5 suggests the use of the integrated squared difference between the corresponding kernel density estimates as a measure of the departure of the events from null expectation. By exploiting its similarity to the integrated square error of a k.d.e., the statistic is shown to be asymptotically normal; the proof generalises a central limit theorem of Hall (1984) to the twosample case. However, simulation results suggest that significance testing should use the bootstrap, since the exact distribution of the statistic appears to be noticeably skewed. A modified statistic, with the smoothing parameters of the two k.d.e.'s constrained to be equal and nonrandom, is also discussed, and shown, both asymptotically and empirically, to have greater power than the original. In Chapter 6, the two techniques are applied to the geographical distribution of cases of laryngeal cancer in South Lancashire for the period 1974 to 1983. The results are similar, for the most part, to a previous analysis of the data, described by Diggle (1990) and Diggle et al (1990). The differences in the two analyses appear to be attributable to the bias or variability of the k.d.e.'s required to calculate the integrated squared difference statistic, and the inaccuracy of the approximations used by the corrected Scan Statistic. Chapter 7 summarises the results obtained in the preceding sections, and considers the implications for further research of the observations made in Chapter 6 regarding the weaknesses of the two statistics. It also suggests extensions to the basic methodology presented here that would increase the range of problems to which the two methods could be applied.
