Semi-structured document categorization with a semantic kernel
Autor: | Younès Bennani, Sujeevan Aseervatham |
---|---|
Přispěvatelé: | Aseervatham, Sujeevan, Laboratoire d'Informatique de Paris-Nord (LIPN), Université Paris 13 (UP13)-Institut Galilée-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS) |
Rok vydání: | 2009 |
Předmět: |
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI]
Support Vector Machine Text Categorization Computer science Semantic analysis (machine learning) [INFO.INFO-TT] Computer Science [cs]/Document and Text Processing 02 engineering and technology computer.software_genre Semantic Similarity [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] 03 medical and health sciences Naive Bayes classifier Structured document [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] Semantic similarity Artificial Intelligence 0202 electrical engineering electronic engineering information engineering 030304 developmental biology 0303 health sciences business.industry Unified Medical Language System Semi-Structured Data [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG] Similitude [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing Support vector machine Kernel method Categorization Ranking Kernel (statistics) Signal Processing Mercer Kernel ComputingMethodologies_DOCUMENTANDTEXTPROCESSING 020201 artificial intelligence & image processing Computer Vision and Pattern Recognition Artificial intelligence Tree kernel business computer Software Natural language processing |
Zdroj: | HAL Pattern Recognition Pattern Recognition, Elsevier, 2008, To appear |
ISSN: | 0031-3203 |
Popis: | Since a decade, text categorization has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. Furthermore, research in text categorization has mainly focused on ''flat texts'' whereas many documents are now semi-structured and especially under the XML format. In this paper, we propose a semantic kernel for semi-structured biomedical documents. The semantic meanings of words are extracted using the unified medical language system (UMLS) framework. The kernel, with a SVM classifier, has been applied to a text categorization task on a medical corpus of free text documents. The results have shown that the semantic kernel outperforms the linear kernel and the naive Bayes classifier. Moreover, this kernel was ranked in the top 10 of the best algorithms among 44 classification methods at the 2007 Computational Medicine Center (CMC) Medical NLP International Challenge. |
Databáze: | OpenAIRE |
Externí odkaz: |