MARS and RNAcmap3: The Master Database of All Possible RNA Sequences Integrated with RNAcmap for RNA Homology Search.

Autor: Chen K; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China.; Peking University Shenzhen Graduate School, Shenzhen 518055, China.; University of Science and Technology of China, Hefei 230026, China.; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China., Litfin T; Institute for Glycomics, Griffith University, Southport, QLD 4222, Australia., Singh J; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China., Zhan J; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China., Zhou Y; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China.; Peking University Shenzhen Graduate School, Shenzhen 518055, China.; Institute for Glycomics, Griffith University, Southport, QLD 4222, Australia.
Jazyk: angličtina
Zdroj: Genomics, proteomics & bioinformatics [Genomics Proteomics Bioinformatics] 2024 May 09; Vol. 22 (1).
DOI: 10.1093/gpbjnl/qzae018
Abstrakt: Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by incorporating the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to the nucleotide (nt) database and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI's nt database or 60-fold larger than RNAcentral. The new dataset along with a new split-search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037, and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.
(© The Author(s) 2024. Published by Oxford University Press and Science Press on behalf of the Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.)
Databáze: MEDLINE