Identifying biological concepts from a protein-related corpus with a probabilistic topic model

Autor: Xinghua Lu, David C. McLean, Bin Zheng
Rok vydání: 2005
Předmět:
Topic model
Text corpus
Vocabulary
Computer science
Abstracting and Indexing
media_common.quotation_subject
MEDLINE
computer.software_genre
lcsh:Computer applications to medicine. Medical informatics
Biochemistry
Latent Dirichlet allocation
Models
Biological

Pattern Recognition
Automated

Body of knowledge
03 medical and health sciences
symbols.namesake
Structural Biology
Artificial Intelligence
Terminology as Topic
Controlled vocabulary
Selection (linguistics)
Molecular Biology
lcsh:QH301-705.5
030304 developmental biology
media_common
Natural Language Processing
0303 health sciences
Information retrieval
Models
Statistical

business.industry
Applied Mathematics
05 social sciences
Proteins
Mutual information
Computer Science Applications
Index (publishing)
lcsh:Biology (General)
Vocabulary
Controlled

symbols
lcsh:R858-859.7
Artificial intelligence
0509 other social sciences
Periodicals as Topic
050904 information & library sciences
business
computer
Natural language processing
Algorithms
Research Article
Zdroj: BMC Bioinformatics
BMC Bioinformatics, Vol 7, Iss 1, p 58 (2006)
ISSN: 1471-2105
Popis: Background Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concepts from a corpus of protein-related MEDLINE© titles and abstracts by applying a probabilistic topic model. Results The latent Dirichlet allocation (LDA) model was applied to the corpus. Based on the Bayesian model selection, 300 major topics were extracted from the corpus. The majority of identified topics/concepts was found to be semantically coherent and most represented biological objects or concepts. The identified topics/concepts were further mapped to the controlled vocabulary of the Gene Ontology (GO) terms based on mutual information. Conclusion The major and recurring biological concepts within a collection of MEDLINE documents can be extracted by the LDA model. The identified topics/concepts provide parsimonious and semantically-enriched representation of the texts in a semantic space with reduced dimensionality and can be used to index text.
Databáze: OpenAIRE