Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing

Autor: Jason Fan, Jamshed Khan, Giulio Ermanno Pibiri, Rob Patro
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: RECOMB 2023-27th International Conference on Research in Computational Molecular Biology, pp. 21–40, Istanbul, Turkey, 16-19/04/2023
Lecture Notes in Computer Science ISBN: 9783031291180
Popis: The reference indexing problem for $$k$$-mers is to pre-process a collection of reference genomic sequences $$\mathcal {R}$$ so that the position of all occurrences of any queried $$k$$-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics.In this work, we introduce the spectrum preserving tiling (SPT), a general representation of $$\mathcal {R}$$ that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in $$\mathcal {R}$$. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for $$k$$-mers into: (1) a $$k$$-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index $$k$$-mer sets can be used to efficiently implement the $$k$$-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the $$k$$-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique $$k$$-mers in $$\mathcal {R}$$.To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool . When indexing over 30,000 bacterial genomes, reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6$$\times $$ slowdown when querying $$k$$-mers from a sequenced readset.Availability: is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.
Databáze: OpenAIRE