CS2A: A Compressed Suffix Array-Based Method for Short Read Alignment

Autor: Xinkun Wang, Hongwei Huo, Jun Huan, Zhigang Sun, Shuangjiang Li, Qiang Yu, Jeffrey Scott Vitter
Rok vydání: 2016
Předmět:
Zdroj: DCC
DOI: 10.1109/dcc.2016.58
Popis: Next generation sequencing technologies generate normous amount of short reads, which poses a significant computational challenge for short read alignment. Furthermore, because of sequence polymorphisms in a population, repetitive sequences, and sequencing errors, there still exist difficulties in correctly aligning all reads. We propose a space-efficient compressed suffix array-based method for short read alignment (CS2A) whose space achieves the high-order empirical entropy of the input string. Unlike BWA that uses two bits to represent a nucleotide, suitable for constant-sized alphabets, our encoding scheme can be applied to the string with any alphabet set. In addition, we present approximate pattern matching on compressed suffix array (CSA) for short read alignment. Our CS2A supports both mismatch and gapped alignments for single-end and paired-end reads mapping, being capable of efficiently aligning short sequencing reads to genome sequences. The experimental results show that CS2A can compete with the popular aligners in memory usage and mapping accuracy. The source code is available online.
Databáze: OpenAIRE