Automatic extraction of transcriptional regulatory interactions of bacteria from biomedical literature using a BERT-based approach.

Autor: Varela-Vega A; Programa de Genómica Computacional, Centro de Ciencias Genómicas, UNAM, Av. Universidad S/N Col. Chamilpa, Cuernavaca, Morelos 62210, México., Posada-Reyes AB; Laboratorio de Microbiología, Inmunología y Salud Pública, Facultad de Estudios Superiores Cuautitlán, UNAM, Carretera Cuautitlán-Teoloyucan Km. 2.5, Xhala, Cuautitlán Izcalli, Estado de México 54714, México., Méndez-Cruz CF; Programa de Genómica Computacional, Centro de Ciencias Genómicas, UNAM, Av. Universidad S/N Col. Chamilpa, Cuernavaca, Morelos 62210, México.
Jazyk: angličtina
Zdroj: Database : the journal of biological databases and curation [Database (Oxford)] 2024 Aug 30; Vol. 2024.
DOI: 10.1093/database/baae094
Abstrakt: Transcriptional regulatory networks (TRNs) give a global view of the regulatory mechanisms of bacteria to respond to environmental signals. These networks are published in biological databases as a valuable resource for experimental and bioinformatics researchers. Despite the efforts to publish TRNs of diverse bacteria, many of them still lack one and many of the existing TRNs are incomplete. In addition, the manual extraction of information from biomedical literature ("literature curation") has been the traditional way to extract these networks, despite this being demanding and time-consuming. Recently, language models based on pretrained transformers have been used to extract relevant knowledge from biomedical literature. Moreover, the benefit of fine-tuning a large pretrained model with new limited data for a specific task ("transfer learning") opens roads to address new problems of biomedical information extraction. Here, to alleviate this lack of knowledge and assist literature curation, we present a new approach based on the Bidirectional Transformer for Language Understanding (BERT) architecture to classify transcriptional regulatory interactions of bacteria as a first step to extract TRNs from literature. The approach achieved a significant performance in a test dataset of sentences of Escherichia coli (F1-Score: 0.8685, Matthew's correlation coefficient: 0.8163). The examination of model predictions revealed that the model learned different ways to express the regulatory interaction. The approach was evaluated to extract a TRN of Salmonella using 264 complete articles. The evaluation showed that the approach was able to accurately extract 82% of the network and that it was able to extract interactions absent in curation data. To the best of our knowledge, the present study is the first effort to obtain a BERT-based approach to extract this specific kind of interaction. This approach is a starting point to address the limitations of reconstructing TRNs of bacteria and diseases of biological interest. Database URL: https://github.com/laigen-unam/BERT-trn-extraction.
(© The Author(s) 2024. Published by Oxford University Press.)
Databáze: MEDLINE