A k -mer query tool for assessing population diversity in pangenomes

Autor: Ziwei Chen, Leonard McMillan, Fernando Pardo-Manuel de Villena, Martin T. Ferris, Maya L. Najarian, Hang Su
Rok vydání: 2021
Předmět:
Zdroj: BCB
DOI: 10.1145/3459930.3469537
Popis: Inexpensive and fast genome sequencing has yielded multiple genome assemblies that, taken together, can be considered as a single pangenome model. However, applying conventional alignment-based sequence analysis to the assemblies of a pangenome is computationally expensive and largely redundant. Here, we present an alignment-free method that analyzes the relationship of any new sample relative to a given pangenome model using selected k-mer queries. We select a representative set of k-mers from the pangenome as probes and determine their frequencies in the raw short-read sequence data. The selection of probes is designed to cover every base of the pangenome, maximize sharing, and identify informative probes that discriminate between haplotypes. The k-mer frequencies are determined using an FM-index built over the raw sequence data of the new sample. Prior to the k-mer search, the probes are reordered to maximize the shared suffixes between succesive k-mers, thus reducing the overall run time compared to executing each search independently. We aggregate the forward and reverse k-mer probe counts, save them in the appropriate rows of a count matrix and remap them back to their locations in the pangenome. The resulting probe database serves as a valuable resource for representing population-scale sequence variations based on the pangenome model.
Databáze: OpenAIRE