Title:

Statistical methods for comparing labelled graphs

Due to the availability of the vast amount of graphstructured data generated in various experiment settings (e.g., biological processes, social connections), the need to rapidly identify network structural differences is becoming increasingly prevalent. In many fields, such as bioinformatics, social network analysis and neuroscience, graphs estimated from the same experimental settings are always defined on a fixed set of objects. We formalize such a problem as a labelled graph comparison problem. The main issue in this area, i.e. measuring the distance between graphs, has been extensively studied over the past few decades. Although a large distance value constitutes evidence of difference between graphs, we are more interested in the issue of inferentially justifying whether a distance value as large or larger than the observed distance could have been obtained simply by chance. However, little work has been done to provide the procedures of statistical inference necessary to formally answer this question. Permutationbased inference has been proposed as a theoretically sound approach and a natural way of tackling such a problem. However, the common permutation procedure is computationally expensive, especially for large graphs. This thesis contributes to the labelled graph comparison problem by addressing three different topics. Firstly, we analyse two labelled graphs by inferentially justifying their independence. A permutationbased testing procedure based on Generalized Hamming Distance (GHD) is proposed. We show rigorously that the permutation distribution is approximately normal for a large network, under three graph models with two different types of edge weights. The statistical significance can be evaluated without the need to resort to computationally expensive permutation procedures. Numerical results suggest the validity of this approximation. With the Topological Overlap edge weight, we suggest that the GHD test is a more powerful test to identify network differences. Secondly, we tackle the problem of comparing two large complex networks in which only localized topological differences are assumed. By applying the normal approximation for the GHD test, we propose an algorithm that can effectively detect localised changes in the network structure from two large complex networks. This algorithm is quickly and easily implemented. Simulations and applications suggest that it is a useful tool to detect subtle differences in complex network structures. Finally, we address the problem of comparing multiple graphs. For this topic, we analyse two different problems that can be interpreted as corresponding to two distinct null hypotheses: (i) a set of graphs are mutually independent; (ii) graphs in one set are independent of graphs in another set. Applications for the multiple graphs problem are commonly found in social network analysis (i) or neuroscience (ii). However, little work has been done to inferentially address the problem of comparing multiple networks. We propose two different statistical testing procedures for (i) and (ii), by again using a normality approximation for GHD. We extend the normality of GHD for the two graphs case to multiple cases, for hypotheses (i) and (ii), with two different permutation strategies. We further build a link between the test of group independence to an existing method, namely the Multivariate Exponential Random Graph Permutation model (MERGP). We show that by applying asymptotic normality, the maximum likelihood estimate of MERGP can be analytically derived. Therefore, the original, computationally expensive, inferential procedure of MERGP can be abandoned.
