Popis: |
Since its emergence in late 2019, SARS-CoV-2 has diversified into a large number of lineages and globally caused multiple waves of infection. Lineages have the potential to spread rapidly and internationally if they have higher intrinsic transmissibility and/or can evade host immune responses, as has been seen with the Alpha, Delta, and Omicron variants of concern (VoCs). Phylogenetic methods provide the gold standard for representing the global diversity of SARS-CoV-2 and to identify newly emerging lineages. However, these methods are computationally expensive, struggle when datasets get too large, and require manual curation to designate new lineages. These issues will only intensify as the vast number of SARS-CoV-2 genomes already available continues to grow. It will therefore be beneficial to develop complementary methods that can incorporate all of the genetic data available, without down sampling, to extract meaningful information rapidly and with minimal curation. Here, we demonstrate the utility of using algorithmic approaches based on word-statistics to represent whole sequences, bringing speed, scalability, and interpretability to the construction of genetic topologies, and that can be used to augment traditional classification practice based on phylogeny. |