Additional file 1 of MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

Autor: Hita, Andrea, Brocart, Gilles, Fernandez, Ana, Rehmsmeier, Marc, Alemany, Anna, Schvartzman, Sol
Rok vydání: 2022
DOI: 10.6084/m9.figshare.18461873
Popis: Additional file 1. Supplementary figures. a–d Frequency of annotated transcriptomic features overlapping in genomic origin by biotype and organism for Human (a), Arabidopsis (b), Mouse (c) and Nematode (d). Dotplot presents combinations of two (blue) or three (purple) overlapping features of different biotypes whose occurrence exceeds the 5% of the total number of features from the less abundant biotype in the combination. The top barplot shows the log10 of the total number of cases per combination. The right barplot shows the relative proportion of features overlapping with any other feature by biotype. e Proportion of reads from a human total-RNA-seq library overlapping to 1, 2 or more annotations according to raw alignments assignation (left) and after MGcount assignation by hierarchical rounds (right). f Proportion of reads multi-mapping to a given number of genomic locations (up to 50), according to raw alignments assignation (left) and after MG community aggregation by MGcount (right).(HBR: Human Brain, sNC: small non-coding; lNC: long non-coding)). g Comparison of counts when only uniquely-mapping reads are counted; all alignments are counted, all alignments are fractionally counted as 1 divided by the number of multi-assigments or quantified with MGcount. h–j Three random sub-graphs of 500 features after aggregation extracted from the long-RNA graph. Each vertex is an annotated feature. Its size is proportional to its number of aligned reads. Vertices are colored in blue for protein-coding, yellow for pseudogenes and pink for other lncRNA transcripts. Each edge connects two features with shared multi-mappers with thickness proportional to the fraction of shared multi-mappers over the total alignments. Shared grey areas delineate MG communities. k Comparison of the number of features detected by biotype with a mean count of 5 over human brain replicates. Intronic counts for MGcount are not considered. In addition to ambiguous alignment quantification approaches, softwares differ in assignation criteria: RSEM uses a probabilistic criteria; featureCounts and MGcount were configured with the same criteria defined as a full-overlap between all the nucleotides of a read and the annotation; Coco and mmquant equire a minimum number of nucleotides for assignation, which were set according to default parameters (10nt for Coco and 1nt for mmquant). The comparison is made at community-level, where mmquant merged genes are annotated here as mmquant communities; and at gene-level, where for communities algorithms a gene is detected if it belongs to a detected community. MGcount results in a lower number of features since each community is quantified as a single feature. In addition, the communities approach allows a more inclusive quantification of the individual features collapsed in communities and facilitates multi-mapping reads back-trace in case of interest, as compared to other methods. Incorporating annotations from multiple sources in the custom GTF allows to detect more transcripts and biotypes. Mmquant quantification leads to a very large number of communities since the same gene can be part of multiple merged genes, which can difficult differential feature expression analysis, as opposed to MGcount.
Databáze: OpenAIRE