Popis: |
Because bacterial strains can exhibit different biological properties, strain-level composition analysis plays a vital role in understanding the functions and dynamics of microbial communities. Metagenomic sequencing has become the major means for probing the microbial composition in host-associated or environmental samples. Despite a plethora of composition analysis tools, they are not optimized to address the challenges in strain-level analysis: a reference database with highly similar reference strain genomes and the presence of multiple strains under one species in a sample. In this work, we present a new strain-level composition analysis tool named StrainScan that employs a novel tree-based k-mer indexing structure to strike a balance between the strain identification accuracy and the computational complexity. We rigorously tested StrainScan on many simulated and real sequencing data and benchmarked StrainScan with popular strain-level analysis tools including Krakenuniq, StrainSeeker, Pathoscope2, Sigma, StrainGE, and Strainest. The results show that StrainScan has higher accuracy and resolution than the the state-of-the-art tools on strain-level composition analysis. It improves the F1-score by 20% in identifying multiple strains with at least 99.89% average nucleotide identity. StrainScan takes short reads and a set of reference strains as input and its source codes are freely available at https://github.com/liaoherui/strainScan. |