Autor: |
Brown NK, Shivakumar VS, Langmead B |
Jazyk: |
angličtina |
Zdroj: |
BioRxiv : the preprint server for biology [bioRxiv] 2024 Nov 02. Date of Electronic Publication: 2024 Nov 02. |
DOI: |
10.1101/2024.10.29.620953 |
Abstrakt: |
Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index classification used matching statistics or pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail to capture coarse-grained information about whether seeds appear co-linearly in the reference. We present a novel approach that additionally obtains coarse-grained co-linearity ("chain") statistics. We do this without using a chaining algorithm, which would require superlinear time in the number of matches. We start with a collection of strings, avoiding the multiple-alignment step required by graph approaches. We rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs that correspond to these multi-MUMs. From these, we select those that can be "tunneled," and mark these with the corresponding multi-MUM identifiers. This yields an ℴ( r + n/d )-space index for a collection of d sequences having a length- n BWT consisting of r maximal equal-character runs. Using the index, we simultaneously compute fine-grained matching statistics and coarse-grained chain statistics in linear time with respect to query length. We found that this substantially improves classification accuracy compared to past compressed-indexing approaches and reaches the same level of accuracy as less efficient alignmentbased methods. |
Databáze: |
MEDLINE |
Externí odkaz: |
|