Matching with mismatches and assorted applications
This thesis consists of three parts, each of independent interest, yet tied together by the problem of matching with mismatches. In the first chapter, we present a motivated exposition of a new randomized algorithm for indexed matching with mismatches which, for constant error (substitution) rates, locates a substring of length m within a string of length n faster than existing algorithms by a factor of O(m/ log(n)). The second chapter turns from this theoretical problem to an entirely practical concern: delta compression of executable code. In contrast to earlier work which has either generated very large deltas when applied to executable code, or has generated small deltas by utilizing platform and processor-specific knowledge, we present a naïve approach — that is, one which does not rely upon any external knowledge — which nevertheless constructs deltas of size comparable to those produced by a platformspecific approach. In the course of this construction, we utilize the result from the first chapter, although it is of primary utility only when producing deltas between very similar executables. The third chapter lies between the horn and ivory gates, being both highly interesting from a theoretical viewpoint and of great practical value. Using the algorithm for matching with mismatches from the first chapter, combined with error correcting codes, we give a practical algorithm for “universal” delta compression (often called “feedback-free file synchronization”) which can operate in the presence of multiple indels and a large number of substitutions.