Efficient reconciliation of genomic datasets of high similarity

Autor:	Shibuya, Yoshihiro, Belazzougui, Djamal, Kucherov, Gregory
Rok vydání:	2022
Předmět:	Invertible Bloom Lookup Tables sketching Applied computing MinHash syncmers minimizers IBLT k-mers
DOI:	10.1101/2022.06.07.495186
Popis:	We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes). LIPIcs, Vol. 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022), pages 14:1-14:14
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::7147edcd63d24ba7d823f92966e4a9af https://doi.org/10.1101/2022.06.07.495186 Zobrazit plný text záznamu