Autor: |
Orakov, Askarbek, Fullam, Anthony, Coelho, Luis Pedro, Khedkar, Supriya, Szklarczyk, Damian, Mende, Daniel R., Schmidt, Thomas S. B., Bork, Peer |
Rok vydání: |
2021 |
DOI: |
10.6084/m9.figshare.14776610.v1 |
Popis: |
Additional file 1: Figure S1. a Percent stacked bar chart of CheckM inferred marker lineage levels (colors) for type 3a simulated chimeric genomes (see Methods & Fig. 2a) across different: 1) divergence levels of source genomes (x-axis); 2) simulated portions of contamination (columns); and 3) scenarios of contamination (‘added’ vs ‘replaced’, rows; see Methods). In a and b, the first column (“0”) are clean (non-chimeric) genomes shown for comparison. b Average inferred CheckM marker lineage depth (y-axis) of simulated chimeric genomes under different contamination scenarios (‘added’ in dark blue; ‘replaced’ in light blue). The true taxonomic depth of divergence between source genomes are indicated in green. c Equivalent to a, but using chimeric genomes simulated from multiple sources (type 3b in Fig. 2a). Columns indicate the number of equally contributing source genomes (n_sources); rows indicate simulation setups (‘0.5’ if 50% of each source genome was used; ‘1/n_sources’ for equal source parts; see Methods). In c & d, the first column (“1”) are clean (non-chimeric) genomes, the second column (“2”) are type 3a genomes as in a & b, shown for comparison. d Average inferred CheckM marker lineage depths (y-axis) with different portions of contamination, equivalent to panel b. Figure S2. Comparison of median scores from GUNC and CheckM of simulations of genomes type 3a and 3b where source genomes make equal contributions summing 1 in total (e.g. 0.2 from each of 5 sources or 0.25 from each of 4 sources). This shows that the trend from Fig. 2b persists when multiple source genomes are mixed in a simulated chimeric genome. Figure S3. F-scores of distinction between clean and chimeric genomes across all divergence levels of source genomes for different simulation scenarios. MIMAG medium is CheckM contamination < 10% and CheckM completeness ≥50%. MIMAG high is CheckM contamination 90% and due to irrelevance to our simulations we decided that additional criteria of presence of rRNAs and tRNAs can be ignored here. “Cont” stands for CheckM contamination and GUNC means GUNC CSS of 0.45 & contamination >2%). The stacked bar plot on the right indicates the numbers of genomes from the overlap in each category. These categories do have overlaps and therefore genomes in them were counted and removed from the set used to count remaining categories in the following order of their genome counts: 71 > 34 > 86 > 187 & 29. Figure S8. a Cumulative plot summarizing genome qualities of various sets of genomes represented by lines of different colors. Any point in a plot indicates a portion of genomes retained in a set (y-axis) after filtering out genomes with GUNC CSS higher than the cutoff (x-axis) & GUNC contamination >5% (ignoring species-level scores). b Cumulative plot illustrating the number of species-level genome bins (SGBs) (from Pasolli et al. 2019). Lines indicate the portion of unique SGBs retained (y-axis) after filtering out SGBs where either “all” or “at least one” genome has GUNC CSS score higher than the cutoff (x-axis) & GUNC contamination >5%. Figure S9. Cumulative plots summarizing genome quality for various genome reference and MAG datasets. This plot is equivalent to main Fig. 3a, but using a reference set based on GTDB v95 instead of GUNC’s default based on proGenomes 2.1 (see Methods for details). Note that the Almeida, Pasolli and Nayfach sets were pre-filtered using variations of the MIMAG medium criterion based on CheckM estimates. GTDB, Genome Taxonomy Database; GMGC, Global Microbial Gene Catalogue. Figure S10. Alluvial illustration of the fate of genomes in GMGC based on filters by GUNC and CheckM. Three filters are: 1) CheckM contamination |
Databáze: |
OpenAIRE |
Externí odkaz: |
|