Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections

Autor:	Xu Guo, Jeffrey Scott Vitter, Xiaoyang Chen, Hongwei Huo
Rok vydání:	2020
Předmět:	Sequence Current (mathematics) Applied Mathematics Search engine indexing Repetitive Sequences Computational Biology Sequence Analysis DNA Data Compression Substring Prime (order theory) Combinatorics Compression (functional analysis) Genetics Repeated sequence Algorithms Biotechnology Mathematics Repetitive Sequences Nucleic Acid
Zdroj:	IEEE/ACM transactions on computational biology and bioinformatics. 18(6)
ISSN:	1557-9964
Popis:	In this paper, we focus upon the important problem of indexing and searching highly repetitive DNA sequence collections. Given a collection $G$ of t sequences $S_i$ of length n each, we can represent G succinctly in $2n\mathcal{H_k}(T) + O(n^\prime \log\log\;n) + o(qn^\prime) + o(tn)$ bits using $O(tn^2 + qn^\prime)$ time, where $\mathcal{H_k}(T)$ is the k-order empirical entropy of the sequence $T \in G$ that is used as the reference sequence, $n^\prime$ is the total number of variations between T and the sequences in G, and q is a small fixed constant. We can restore any length len substring $S[sp, \dots, sp + len - 1]$ of $S \in G$ in $O(n_{s}^{\prime} + len(\log\;n)^2/ \log\;n)$ time and report all positions where P occurs in G in $O(m\cdot t + occ\cdot t \cdot (\log\;n)^2/ \log\; \log\;n)$ time. In addition, we propose a dynamic programming method to find the variations between T and the sequences in G in a space-efficient way, with which we can build succinct structures to enable efficient search. For highly repetitive sequences, experimental results on the tested data demonstrate that the proposed method has significant advantages in space usage and retrieval time over the current state-of-the-art methods. The source code is available online.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::d3e1ec7705e5d4e3e062cc5ea58688ee https://pubmed.ncbi.nlm.nih.gov/31985436 Zobrazit plný text záznamu