Identifying biological concepts from a protein-related corpus with a probabilistic topic model
Autor: | Xinghua Lu, David C. McLean, Bin Zheng |
---|---|
Rok vydání: | 2005 |
Předmět: |
Topic model
Text corpus Vocabulary Computer science Abstracting and Indexing media_common.quotation_subject MEDLINE computer.software_genre lcsh:Computer applications to medicine. Medical informatics Biochemistry Latent Dirichlet allocation Models Biological Pattern Recognition Automated Body of knowledge 03 medical and health sciences symbols.namesake Structural Biology Artificial Intelligence Terminology as Topic Controlled vocabulary Selection (linguistics) Molecular Biology lcsh:QH301-705.5 030304 developmental biology media_common Natural Language Processing 0303 health sciences Information retrieval Models Statistical business.industry Applied Mathematics 05 social sciences Proteins Mutual information Computer Science Applications Index (publishing) lcsh:Biology (General) Vocabulary Controlled symbols lcsh:R858-859.7 Artificial intelligence 0509 other social sciences Periodicals as Topic 050904 information & library sciences business computer Natural language processing Algorithms Research Article |
Zdroj: | BMC Bioinformatics BMC Bioinformatics, Vol 7, Iss 1, p 58 (2006) |
ISSN: | 1471-2105 |
Popis: | Background Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concepts from a corpus of protein-related MEDLINE© titles and abstracts by applying a probabilistic topic model. Results The latent Dirichlet allocation (LDA) model was applied to the corpus. Based on the Bayesian model selection, 300 major topics were extracted from the corpus. The majority of identified topics/concepts was found to be semantically coherent and most represented biological objects or concepts. The identified topics/concepts were further mapped to the controlled vocabulary of the Gene Ontology (GO) terms based on mutual information. Conclusion The major and recurring biological concepts within a collection of MEDLINE documents can be extracted by the LDA model. The identified topics/concepts provide parsimonious and semantically-enriched representation of the texts in a semantic space with reduced dimensionality and can be used to index text. |
Databáze: | OpenAIRE |
Externí odkaz: |