Shannon: An Information-Optimal de Novo RNA-Seq Assembler
Autor: | Lior Pachter, Kayvon Mazooji, Sreeram Kannan, David Tse, Joseph Hui |
---|---|
Rok vydání: | 2016 |
Předmět: |
0303 health sciences
Sequence Theoretical computer science Computer science 030302 biochemistry & molecular biology Graph partition Sequence assembly Genomics Information theory Quantitative Biology::Genomics De Bruijn graph Graph 03 medical and health sciences symbols.namesake symbols Key (cryptography) Graph (abstract data type) Algorithm 030304 developmental biology |
Popis: | De novo assembly of short RNA-Seq reads into transcripts is challenging due to sequence similarities in transcriptomes arising from gene duplications and alternative splicing of transcripts. We present Shannon, an RNA-Seq assembler with an optimality guarantee derived from principles of information theory: Shannon reconstructs nearly all information-theoretically reconstructable transcripts. Shannon is based on a theory we develop for de novo RNA-Seq assembly that reveals differing abundances among transcripts to be the key, rather than the barrier, to effective assembly. The assembly problem is formulated as a sparsest-flow problem on a transcript graph, and the heart of Shannon is a novel iterative flow-decomposition algorithm. This algorithm provably solves the information-theoretically reconstructable instances in linear-time even though the general sparsest-flow problem is NP-hard. Shannon also incorporates several additional new algorithmic advances: a new error-correction algorithm based on successive cancelation, a multi-bridging algorithm that carefully utilizes read information in the k-mer de Bruijn graph, and an approximate graph partitioning algorithm to split the transcriptome de Bruijn graph into smaller components. In tests on large RNA-Seq datasets, Shannon obtains significant increases in sensitivity along with improvements in specificity in comparison to state-of-the-art assemblers. |
Databáze: | OpenAIRE |
Externí odkaz: |