Sequence distance embeddings
Sequences represent a large class of fundamental objects in Computer Science sets, strings, vectors and permutations are considered to be sequences. Distances between sequences measure their similarity, and computations based on distances are ubiquitous: either to compute the distance, or to use distance computation as part of a more complex problem. This thesis takes a very specific approach to solving questions of sequence distance: sequences are embedded into other distance measures, so that distance in the new space approximates the original distance. This allows the solution of a variety of problems including: Fast computation of short sketches in a variety of computing models, which allow sequences to be compared in constant time and space irrespective of the size of the original sequences. Approximate nearest neighbor and clustering problems, significantly faster than the naive exact solutions. Algorithms to find approximate occurrences of pattern sequences in long text sequences in near linear time. Efficient communication schemes to approximate the distance between, and exchange, sequences in close to the optimal amount of communication. Solutions are given for these problems for a variety of distances, including fundamental distances on sets and vectors; distances inspired by biological problems for permutations; and certain text editing distances for strings. Many of these embeddings are computable in a streaming model where the data is too large to store in memory, and instead has to be processed as and when it arrives, piece by piece. The embeddings are also shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.