Popis: |
Genome sequencing is revolutionising infectious disease epidemiology, providing a huge step forward in sensitivity and specificity over more traditional molecular typing techniques. However, the complexity of genome data often means that its analysis and interpretation requires high-performance compute infrastructure and dedicated bioinformatics support. Furthermore, current methods have limitations that can differ between analyses and are often opaque to the user, and their reliance on multiple external dependencies makes reproducibility difficult. Here I introduce SKA, a toolkit for analysis of genome sequence data from closely-related, small, haploid genomes. SKA uses split kmers to rapidly identify variation between genome sequences, making it possible to analyse hundreds of genomes on a standard home computer. Tests on publicly available simulated and real-life data show that SKA is both faster and more efficient than the gold standard methods used today while retaining similar levels of accuracy for epidemiological purposes. SKA can take raw read data or genome assemblies as input and calculate pairwise distances, create single linkage clusters and align genomes to a reference genome or using a reference-free approach. SKA requires few decisions to be made by the user, which, along with its computational efficiency, allows genome analysis to become accessible to those with only basic bioinformatics training. The limitations of SKA are also far more transparent than for current approaches, and future improvements to mitigate these limitations are possible. Overall, SKA is a powerful addition to the armoury of the genomic epidemiologist. SKA source code is available from Github (https://github.com/simonrharris/SKA). |