Title:

Cluster detection by lifting with application to phylogenetics

In this thesis, we propose a new algorithm which automatically detects the number of clusters in a tree structure data set by denoising some generalized node values in the tree using lifting “one coefficient at a time” (LOCAAT) algorithm introduced by Jansen et al. (2001). Our algorithm can be applied to any multidimensional data set using compactness value as a node value or to phylogenetic data sets, DNA sequences, using either compactness value or dissimilarity score as a node value. Compactness value is defined as the average distance from the centroid of each possible cluster in the tree, and the dissimilarity score is the average number of loci, where at least one of them does not share the same nucleotide between sequences under the node of interest. For multidimensional data sets, we consider each node in the tree as a possible location of a cluster after denoising the tree by LOCAAT. Thus, for each possible cluster, we check how much departure we can allow from the centroid of the cluster to assign the objects under the node of interest as a cluster. Then if a node and all its child nodes are denoised less than or equal to the allowed amount of departure from the centroid of their clusters, a cluster is located at this node. We also propose another version of our algorithm based on nondecimated lifting (Knight & Nason, 2009) in which we generate a probability of being clustered for each node. If a node and all its child nodes have a probability of being clustered less than or equal to the probability of acceptance, θ∈[0; 1], a cluster is located at this node. We provide a comparison study between our algorithms and some available internal cluster validity indices (CVIs) in the literature using some artificial data sets and a real data set. In addition, we compare the performance of each method using some available external cluster validity scores. For phylogenetic data sets, we check the performance of our algorithms and other CVIs using both compactness value and dissimilarity score as a node value. To be able to compute compactness value for a phylogenetic tree, we need to find the position of each specie in Rp using multidimensional scaling (MDS), and then we can find which species share the similar features using our algorithm. If we use the dissimilarity score as a node value, we will cluster similar species together by finding how much difference we can allow between species. We check the performance of our algorithms using some artificial and a real data sets. In the final part of our thesis, we propose a visualization tool for cophylogenetic data sets. We only consider the associated two phylogenetic trees case, and we apply our algorithm to both host and parasite trees separately to provide a summary of these data sets. We check the performance of our algorithm using two wellknown cophylogenetic data sets.
