Semi-structured document categorization with a semantic kernel

Autor: Younès Bennani, Sujeevan Aseervatham
Přispěvatelé: Aseervatham, Sujeevan, Laboratoire d'Informatique de Paris-Nord (LIPN), Université Paris 13 (UP13)-Institut Galilée-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS)
Rok vydání: 2009
Předmět:
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI]
Support Vector Machine
Text Categorization
Computer science
Semantic analysis (machine learning)
[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing
02 engineering and technology
computer.software_genre
Semantic Similarity
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
03 medical and health sciences
Naive Bayes classifier
Structured document
[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]
Semantic similarity
Artificial Intelligence
0202 electrical engineering
electronic engineering
information engineering

030304 developmental biology
0303 health sciences
business.industry
Unified Medical Language System
Semi-Structured Data
[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG]
Similitude
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
Support vector machine
Kernel method
Categorization
Ranking
Kernel (statistics)
Signal Processing
Mercer Kernel
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
020201 artificial intelligence & image processing
Computer Vision and Pattern Recognition
Artificial intelligence
Tree kernel
business
computer
Software
Natural language processing
Zdroj: HAL
Pattern Recognition
Pattern Recognition, Elsevier, 2008, To appear
ISSN: 0031-3203
Popis: Since a decade, text categorization has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. Furthermore, research in text categorization has mainly focused on ''flat texts'' whereas many documents are now semi-structured and especially under the XML format. In this paper, we propose a semantic kernel for semi-structured biomedical documents. The semantic meanings of words are extracted using the unified medical language system (UMLS) framework. The kernel, with a SVM classifier, has been applied to a text categorization task on a medical corpus of free text documents. The results have shown that the semantic kernel outperforms the linear kernel and the naive Bayes classifier. Moreover, this kernel was ranked in the top 10 of the best algorithms among 44 classification methods at the 2007 Computational Medicine Center (CMC) Medical NLP International Challenge.
Databáze: OpenAIRE