Pooled assembly of marine metagenomic datasets: enriching annotation through chimerism
Autor: | Jonathan D. Magasin, Dietlind L. Gerloff |
---|---|
Rok vydání: | 2014 |
Předmět: |
Statistics and Probability
Sequence analysis Archaeal Proteins Datasets as Topic Computational biology Biology computer.software_genre Biochemistry Genome Chimerism Annotation Bacterial Proteins Genome Archaeal Molecular Biology Contig Molecular Sequence Annotation Sequence Analysis DNA Computer Science Applications Computational Mathematics Computational Theory and Mathematics Metagenomics Pyrosequencing Data mining computer Genome Bacterial |
Zdroj: | Bioinformatics (Oxford, England). 31(3) |
ISSN: | 1367-4811 |
Popis: | Motivation: Despite advances in high-throughput sequencing, marine metagenomic samples remain largely opaque. A typical sample contains billions of microbial organisms from thousands of genomes and quadrillions of DNA base pairs. Its derived metagenomic dataset underrepresents this complexity by orders of magnitude because of the sparseness and shortness of sequencing reads. Read shortness and sequencing errors pose a major challenge to accurate species and functional annotation. This includes distinguishing known from novel species. Often the majority of reads cannot be annotated and thus cannot help our interpretation of the sample. Results: Here, we demonstrate quantitatively how careful assembly of marine metagenomic reads within, but also across, datasets can alleviate this problem. For 10 simulated datasets, each with species complexity modeled on a real counterpart, chimerism remained within the same species for most contigs (97%). For 42 real pyrosequencing (‘454’) datasets, assembly increased the proportion of annotated reads, and even more so when datasets were pooled, by on average 1.6% (max 6.6%) for species, 9.0% (max 28.7%) for Pfam protein domains and 9.4% (max 22.9%) for PANTHER gene families. Our results outline exciting prospects for data sharing in the metagenomics community. While chimeric sequences should be avoided in other areas of metagenomics (e.g. biodiversity analyses), conservative pooled assembly is advantageous for annotation specificity and sensitivity. Intriguingly, our experiment also found potential prospects for (low-cost) discovery of new species in ‘old’ data. Contact: dgerloff@ffame.org Supplementary information: Supplementary data are available at Bioinformatics online. |
Databáze: | OpenAIRE |
Externí odkaz: |