Autor: |
Thurimella K; Broad Institute of MIT and Harvard, Cambridge, MA, USA.; Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.; Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK.; School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA., Mohamed AMT; Broad Institute of MIT and Harvard, Cambridge, MA, USA.; Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA., Graham DB; Broad Institute of MIT and Harvard, Cambridge, MA, USA.; Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA., Owens RM; Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK., La Rosa SL; Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway., Plichta DR; Broad Institute of MIT and Harvard, Cambridge, MA, USA.; Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA., Bacallado S; Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK., Xavier RJ; Broad Institute of MIT and Harvard, Cambridge, MA, USA.; Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA. |
Abstrakt: |
In metagenomics, the pool of uncharacterized microbial enzymes presents a challenge for functional annotation. Among these, carbohydrate-active enzymes (CAZymes) stand out due to their pivotal roles in various biological processes related to host health and nutrition. Here, we present CAZyLingua, the first tool that harnesses protein language model embeddings to build a deep learning framework that facilitates the annotation of CAZymes in metagenomic datasets. Our benchmarking results showed on average a higher F1 score (reflecting an average of precision and recall) on the annotated genomes of Bacteroides thetaiotaomicron , Eggerthella lenta and Ruminococcus gnavus compared to the traditional sequence homology-based method in dbCAN2. We applied our tool to a paired mother/infant longitudinal dataset and revealed unannotated CAZymes linked to microbial development during infancy. When applied to metagenomic datasets derived from patients affected by fibrosis-prone diseases such as Crohn's disease and IgG4-related disease, CAZyLingua uncovered CAZymes associated with disease and healthy states. In each of these metagenomic catalogs, CAZyLingua discovered new annotations that were previously overlooked by traditional sequence homology tools. Overall, the deep learning model CAZyLingua can be applied in combination with existing tools to unravel intricate CAZyme evolutionary profiles and patterns, contributing to a more comprehensive understanding of microbial metabolic dynamics. |