Comparing a Rule Based vs. Statistical System for Automatic Categorization of MEDLINE Documents According to Biomedical Specialty

Autor: Susanne M. Humphrey, Aurélie Névéol, Allen Browne, Julien Gobeil, Patrick Ruch, Stéfan J. Darmoni
Přispěvatelé: Darmoni, Stéfan, Chercheur indépendant, Laboratoire Leibniz (Leibniz - IMAG), Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Grenoble (INPG)-Université Joseph Fourier - Grenoble 1 (UJF), Service d'informatique médicale (SIM), Hôpitaux de Genève, Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)
Jazyk: angličtina
Rok vydání: 2009
Předmět:
Zdroj: Journal of the American Society for Information Science and Technology
Journal of the American Society for Information Science and Technology, Association for Information Science and Technology (ASIS&T), 2009, pp.2530-2539
ISSN: 1532-2882
1532-2890
Popis: Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings(®) (MeSH(®)) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI) based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for one hundred MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures, performance is comparable, and for one measure, JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule based) might be combined and then evaluated showing they are complementary to one another.
Databáze: OpenAIRE