Feasibility of N-Gram Data-Structures for Next-Generation Pathogen Signature Design

Autor:	S N Gardner
Rok vydání:	2009
Předmět:	Set (abstract data type) Sequence Sequence database Computer science String (computer science) Hash function Data mining Data structure computer.software_genre computer Algorithm Signature (logic) Term (time)
DOI:	10.2172/947229
Popis:	We determined the most appropriate data structure for handling n-gram (also known as k-mer) string comparisons and storage for genomic sequence data that will scale in terms of memory and speed. This is critical to maintain LLNL as the leader in pathogen detection, as it will guide the design of the 'Next Generation' system for computational signature prediction. There are two parts to k-mer analysis for signature prediction that we investigated. First is the enumeration and frequency counting of all observed k-mers in a sequence database (k-mer is a biological term equivalent to the CS term n-gram). Second is the down-selection and pairing of k-mers to generate a signature. We determined that for the first part, suffix arrays are the preferred method to enumerate k-mers, being memory efficient and relatively easy and fast to compute. For the second part, a subset of the k-mers can be stored and manipulated in a hash, that subset determination based on desired frequency characteristics such as most/least frequent from a set, shared among sequence sets, or discriminating across sequence sets.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f17c5c218fc7463c5fb9dd40e2ef4e61 https://doi.org/10.2172/947229 Zobrazit plný text záznamu