Use this URL to cite or link to this record in EThOS: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.769972
Title: Algorithmic advances in handling uncertainty & regularity in strings
Author: Kundu, Ritu
ISNI:       0000 0004 7660 3780
Awarding Body: King's College London
Current Institution: King's College London (University of London)
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
Abstract:
Genomics, owing to its immediate applications in medicine, forensics, evolu-tionary and molecular biology etc., has witnessed a dramatic advancement in the technology of acquiring and generating data. Consequently, the botleneck of the information-extraction pipeline has shifted from data-acquisition to the computational capacity for storing and processing prodigious amounts of data. Uncertainty and identification of regularity in data are two key aspects contributing to the complexity of the task of mining knowledge and insights from genomic data. One form of macro-level uncertainty arises in sequential data when a single representation is used for a multitude of strings which are by and large similar. For example, in human genomics, the reference genome has been represented as a single sequence so far. Now, with the availability of a vast collection of human genomes, the so called reference cohorts seem more sensible in order to avoid the reference-bias presented by a single genomic sequence. Different representations have recently been explored in an attempt to organise human genomic sequences in reference cohorts. Each such representation has its own challenges. Moreover, in genomic sequences, local regularity (a term encapsulating various forms of repetitions) is often flanked by regions of interest - genes, for example - which are, in comparison, not regular. In other words, the regularity of a local segment of genomic data is indicative of it being a potential biologically-important region. One of the multiple possible ways to express this notion of local regularity of strings can be in terms of unbordered factors of a string. A border of a string - one of the central properties characterising the regularity associated with repetitions - is its (possibly empty) proper factor occurring both as a prefix and as a suffix. This dissertation presents an assortment of efficient novel algorithms - based on string algorithms and data-structures - to solve three problems that find direct or indirect applications in genomic data analysis. Specifically, the presented algorithms handle the uncertainty arising in the representation of an ensemble of sequences as well as characterise the regularity present in a sequence in terms of unbordered factors. Firstly, we present an optimal algorithm - in terms of both time and space -improving the state-of-the-art, to identify Superbubbles (a special type of self-contained subgraphs, each with a single source and a single sink) in de Bruijn sequence graphs for genome assembly. Identifying these motifs in a reference graph is crucial for overcoming the lack of a coordinate system in the graphical representation of a reference cohort. Secondly, we introduce another representation for sequential data with macrolevel uncertainty, called Elastic-degenerate strings. The motivation is to condense a set of genomes (with variations) as a reference cohort. An elastic-degenerate string is a string in which an elastic-degenerate symbol can occur at one or more positions; each such symbol corresponds to a set of two or more variable-length strings. We not only formalise the concept of elastic-degenerate strings but also present a practically efficient algorithm to solve the pattern matching problem in a given elastic-degenerate text. Lastly, we provide a quasilinear time algorithm to compute the Longest Unbordered Factor Array of a string w for general alphabets. This array specifies the length of the maximal unbordered factor (the longest factor which does not have a border) starting at each position of w. This is a major improvement on the running time of the currently best worst-case algorithm working in O(n1.5) time for integer alphabets, where n is the length of w. Although this problem is rooted in theory, the data-structures proposed in this algorithm can be used to characterise the regularity of a sequence; this has possible applications in genomics.
Supervisor: Pissis, Solon Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.769972  DOI: Not available
Share: