Title:

Algorithmic advances in handling uncertainty & regularity in strings

Genomics, owing to its immediate applications in medicine, forensics, evolutionary and molecular biology etc., has witnessed a dramatic advancement in the technology of acquiring and generating data. Consequently, the botleneck of the informationextraction pipeline has shifted from dataacquisition to the computational capacity for storing and processing prodigious amounts of data. Uncertainty and identification of regularity in data are two key aspects contributing to the complexity of the task of mining knowledge and insights from genomic data. One form of macrolevel uncertainty arises in sequential data when a single representation is used for a multitude of strings which are by and large similar. For example, in human genomics, the reference genome has been represented as a single sequence so far. Now, with the availability of a vast collection of human genomes, the so called reference cohorts seem more sensible in order to avoid the referencebias presented by a single genomic sequence. Different representations have recently been explored in an attempt to organise human genomic sequences in reference cohorts. Each such representation has its own challenges. Moreover, in genomic sequences, local regularity (a term encapsulating various forms of repetitions) is often flanked by regions of interest  genes, for example  which are, in comparison, not regular. In other words, the regularity of a local segment of genomic data is indicative of it being a potential biologicallyimportant region. One of the multiple possible ways to express this notion of local regularity of strings can be in terms of unbordered factors of a string. A border of a string  one of the central properties characterising the regularity associated with repetitions  is its (possibly empty) proper factor occurring both as a prefix and as a suffix. This dissertation presents an assortment of efficient novel algorithms  based on string algorithms and datastructures  to solve three problems that find direct or indirect applications in genomic data analysis. Specifically, the presented algorithms handle the uncertainty arising in the representation of an ensemble of sequences as well as characterise the regularity present in a sequence in terms of unbordered factors. Firstly, we present an optimal algorithm  in terms of both time and space improving the stateoftheart, to identify Superbubbles (a special type of selfcontained subgraphs, each with a single source and a single sink) in de Bruijn sequence graphs for genome assembly. Identifying these motifs in a reference graph is crucial for overcoming the lack of a coordinate system in the graphical representation of a reference cohort. Secondly, we introduce another representation for sequential data with macrolevel uncertainty, called Elasticdegenerate strings. The motivation is to condense a set of genomes (with variations) as a reference cohort. An elasticdegenerate string is a string in which an elasticdegenerate symbol can occur at one or more positions; each such symbol corresponds to a set of two or more variablelength strings. We not only formalise the concept of elasticdegenerate strings but also present a practically efficient algorithm to solve the pattern matching problem in a given elasticdegenerate text. Lastly, we provide a quasilinear time algorithm to compute the Longest Unbordered Factor Array of a string w for general alphabets. This array specifies the length of the maximal unbordered factor (the longest factor which does not have a border) starting at each position of w. This is a major improvement on the running time of the currently best worstcase algorithm working in O(n1.5) time for integer alphabets, where n is the length of w. Although this problem is rooted in theory, the datastructures proposed in this algorithm can be used to characterise the regularity of a sequence; this has possible applications in genomics.
