BioLemmatizer: a lemmatization tool for morphological processing of biomedical text

Autor:	William A. Baumgartner, Tom Christiansen, Haibin Liu, Karin Verspoor
Jazyk:	angličtina
Předmět:	Computer Networks and Communications Computer science Health Informatics 02 engineering and technology Scientific literature Lexicon computer.software_genre lcsh:Computer applications to medicine. Medical informatics Set (abstract data type) 03 medical and health sciences 0202 electrical engineering electronic engineering information engineering 030304 developmental biology 0303 health sciences Lemma (mathematics) business.industry Lemmatisation Research Variety (linguistics) Data science Biomedical text mining Computer Science Applications Information extraction lcsh:R858-859.7 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing Information Systems
Zdroj:	Journal of Biomedical Semantics Journal of Biomedical Semantics, Vol 3, Iss 1, p 3 (2012)
ISSN:	2041-1480
DOI:	10.1186/2041-1480-3-3
Popis:	Background The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research. Results In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the LLL05 corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system. Conclusions The BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from http://biolemmatizer.sourceforge.net.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::7e734e4f7166c284d4900bc6177a3ca4 Zobrazit plný text záznamu Plný text ve formátu PDF