Autor: |
Xiaoming Xu, Zekun Yin, Lifeng Yan, Hao Zhang, Borui Xu, Yanjie Wei, Beifang Niu, Bertil Schmidt, Weiguo Liu |
Rok vydání: |
2022 |
DOI: |
10.1101/2022.10.13.512052 |
Popis: |
We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences (RefSeq: 455 GB in FASTA format) can be clustered within less than 6 minutes and 1,009,738 GenBank assembled bacterial genomes (4.0 TB in FASTA format) within only 34 minutes on a 128-core workstation. Our results further identify 1,269 repetitive genomes (identical nucleotide content) in RefSeq bacterial genomes. |
Databáze: |
OpenAIRE |
Externí odkaz: |
|