Popis: |
Inexpensive and fast genome sequencing has yielded multiple genome assemblies that, taken together, can be considered as a single pangenome model. However, applying conventional alignment-based sequence analysis to the assemblies of a pangenome is computationally expensive and largely redundant. Here, we present an alignment-free method that analyzes the relationship of any new sample relative to a given pangenome model using selected k-mer queries. We select a representative set of k-mers from the pangenome as probes and determine their frequencies in the raw short-read sequence data. The selection of probes is designed to cover every base of the pangenome, maximize sharing, and identify informative probes that discriminate between haplotypes. The k-mer frequencies are determined using an FM-index built over the raw sequence data of the new sample. Prior to the k-mer search, the probes are reordered to maximize the shared suffixes between succesive k-mers, thus reducing the overall run time compared to executing each search independently. We aggregate the forward and reverse k-mer probe counts, save them in the appropriate rows of a count matrix and remap them back to their locations in the pangenome. The resulting probe database serves as a valuable resource for representing population-scale sequence variations based on the pangenome model. |