Learning and approximation algorithms for problems motivated by evolutionary trees
In this thesis we consider some computational problems motivated by the biological problem of reconstructing evolutionary trees. In this thesis, we are concerned with the design and analysis of efficient algorithms for clearly defined combinatorial problems motived by this application area. We present results for two different kinds of problem. Our first problem is motivated by models of evolution that describe the evolution of biological species in terms of a stochastic process that alters the DNA of species. The particular stochastic model that we considered is called the Two-State General Markov Model. In this model, an evolutionary tree can be associated with a distribution on the different "patterns" that may appear among the sequences for all the species in the evolutionary tree. Then the data for a collection of species whose evolutionary tree is unknown can be viewed as samples from this (unknown) distribution. An interesting problem asks whether we can use samples from an unknown evolutionary tree M to find another tree M*for those species, so that the distribution of M* is similar to that of M. This is essentially a PAC-learning problem ("Probably Approximately Correct") in the sense of Valiant and Kearns et al. Our results show that evolutionary trees in the Two-State General Markov can be efficiently PAC-learned in the variation distance metric using a "reasonable" number of samples. The two other problems that we consider are combinatorial problems that are also motivated by evolutionary tree construction. The input to each of these problems consists of a fixed tree topology whose leaves are bijectively labelled by the elements of a species set, as well as data for those species. Both problems involve labelling the internal nodes in the fixed topology in order to minimize some function on that tree (both functions that we consider are assumed to test the quality of the tree topology in some way). The two problems that we consider are known to be NP-hard. Our contribution is to present efficient approximation algorithms for both problems.