Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation.

Autor: Dobrokhotov PB; Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet - CH-1211 Geneva 4, Switzerland. Pavel.Dobrokhotov@isb-sib.ch, Goutte C, Veuthey AL, Gaussier E
Jazyk: angličtina
Zdroj: Bioinformatics (Oxford, England) [Bioinformatics] 2003; Vol. 19 Suppl 1, pp. i91-4.
DOI: 10.1093/bioinformatics/btg1011
Abstrakt: Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify significant terms in the documents.
Results: With a Probabilistic Latent Categoriser (PLC) we obtained 69% recall and 59% precision for relevant documents in a representative query. As the PLC technique provides the relative contribution of each term to the final document score, we used the Kullback-Leibler symmetric divergence to determine the most discriminating words for Swiss-Prot medical annotation. This information should allow curators to understand classification results better. It also has great value for fine-tuning the linguistic pre-processing of documents, which in turn can improve the overall classifier performance.
Databáze: MEDLINE