CS2A: A Compressed Suffix Array-Based Method for Short Read Alignment
Autor: | Xinkun Wang, Hongwei Huo, Jun Huan, Zhigang Sun, Shuangjiang Li, Qiang Yu, Jeffrey Scott Vitter |
---|---|
Rok vydání: | 2016 |
Předmět: |
0301 basic medicine
Compressed suffix array education.field_of_study Source code Theoretical computer science Computer science media_common.quotation_subject Population Approximation algorithm Genomics DNA sequencing 03 medical and health sciences 030104 developmental biology Entropy (information theory) Pattern matching education Algorithm media_common |
Zdroj: | DCC |
DOI: | 10.1109/dcc.2016.58 |
Popis: | Next generation sequencing technologies generate normous amount of short reads, which poses a significant computational challenge for short read alignment. Furthermore, because of sequence polymorphisms in a population, repetitive sequences, and sequencing errors, there still exist difficulties in correctly aligning all reads. We propose a space-efficient compressed suffix array-based method for short read alignment (CS2A) whose space achieves the high-order empirical entropy of the input string. Unlike BWA that uses two bits to represent a nucleotide, suitable for constant-sized alphabets, our encoding scheme can be applied to the string with any alphabet set. In addition, we present approximate pattern matching on compressed suffix array (CSA) for short read alignment. Our CS2A supports both mismatch and gapped alignments for single-end and paired-end reads mapping, being capable of efficiently aligning short sequencing reads to genome sequences. The experimental results show that CS2A can compete with the popular aligners in memory usage and mapping accuracy. The source code is available online. |
Databáze: | OpenAIRE |
Externí odkaz: |