Popis: |
Seed-chain-extend withk-mer seeds is a powerful heuristic technique for sequence alignment employed by modern sequence aligners. While effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend withk-mers in expectation. Assume we are given a random nucleotide sequence of length ∼nthat is indexed (or seeded) and a mutated substring of length ∼m≤nwith mutation rate θ < 0.206. We prove that we can find ak= Θ(log n) for thek-mer size such that the expected runtime of seed-chain-extend under optimal linear gap cost chaining and quadratic time gap extension is O(mnf(θ)log n) where f (θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than 1 − O( 1/√m ) fraction of the homologous bases are recoverable under an optimal chain. We also show that our bounds work whenk-mers are sketched, i.e. only a subset of allk-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(θ) can be further reduced. |