Use this URL to cite or link to this record in EThOS:
Title: Multiple sequence analysis in the presence of alignment uncertainty
Author: Herman, Joseph L.
ISNI:       0000 0004 5355 3119
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2014
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Restricted access.
Access from Institution:
Sequence alignment is one of the most intensely studied problems in bioinformatics, and is an important step in a wide range of analyses. An issue that has gained much attention in recent years is the fact that downstream analyses are often highly sensitive to the specific choice of alignment. One way to address this is to jointly sample alignments along with other parameters of interest. In order to extend the range of applicability of this approach, the first chapter of this thesis introduces a probabilistic evolutionary model for protein structures on a phylogenetic tree; since protein structures typically diverge much more slowly than sequences, this allows for more reliable detection of remote homologies, improving the accuracy of the resulting alignments and trees, and reducing sensitivity of the results to the choice of dataset. In order to carry out inference under such a model, a number of new Markov chain Monte Carlo approaches are developed, allowing for more efficient convergence and mixing on the high-dimensional parameter space. The second part of the thesis presents a directed acyclic graph (DAG)-based approach for representing a collection of sampled alignments. This DAG representation allows the initial collection of samples to be used to generate a larger set of alignments under the same approximate distribution, enabling posterior alignment probabilities to be estimated reliably from a reasonable number of samples. If desired, summary alignments can then be generated as maximum-weight paths through the DAG, under various types of loss or scoring functions. The acyclic nature of the graph also permits various other types of algorithms to be easily adapted to operate on the entire set of alignments in the DAG. In the final part of this work, methodology is introduced for alignment-DAG-based sequence annotation using hidden Markov models, and RNA secondary structure prediction using stochastic context-free grammars. Results on test datasets indicate that the additional information contained within the DAG allows for improved predictions, resulting in substantial gains over simply analysing a set of alignments one by one.
Supervisor: Hein, Jotun Sponsor: EPSRC
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Bioinformatics (life sciences) ; Structural genomics ; Mathematical genetics and bioinformatics (statistics) ; Stochastic processes ; Probability ; multiple sequence alignment ; structural alignment ; alignment uncertainty ; statistical alignment ; Bayesian phylogenetics ; Markov chain Monte Carlo ; globin evolution ; graphical models ; directed acyclic graphs ; stochastic context-free grammars ; hidden Markov models ; sequence annotation ; RNA secondary structure