A Concept Vector Space Model for Semantic Kernels

Autor: Sujeevan Aseervatham
Přispěvatelé: Laboratoire d'Informatique de Paris-Nord (LIPN), Université Paris 13 (UP13)-Institut Galilée-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS), Aseervatham, Sujeevan
Jazyk: angličtina
Rok vydání: 2008
Předmět:
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI]
Support Vector Machine
Computer science
[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing
02 engineering and technology
computer.software_genre
Semantic Similarity
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]
Artificial Intelligence
Polynomial kernel
String kernel
020204 information systems
0202 electrical engineering
electronic engineering
information engineering

Probabilistic latent semantic analysis
business.industry
Pattern recognition
[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG]
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
Kernel
Kernel method
Kernel embedding of distributions
Kernel (statistics)
Radial basis function kernel
020201 artificial intelligence & image processing
Artificial intelligence
Text categorization
Tree kernel
business
computer
Natural language processing
Zdroj: International Journal on Artificial Intelligence Tools
International Journal on Artificial Intelligence Tools, World Scientific Publishing, 2008, 18 (2), To appear
HAL
ISSN: 0218-2130
Popis: International audience; Kernels are widely used in Natural Language Processing as similarity measures within inner-product based learning methods like the Support Vector Machine. The Vector Space Model (VSM) is extensively used for the spatial representation of the documents. However, it is purely a statistical representation. In this paper, we present a Concept Vector Space Model (CVSM) representation which uses linguistic prior knowledge to capture the meanings of the documents. We also propose a linear kernel and a latent kernel for this space. The linear kernel takes advantage of the linguistic concepts whereas the latent kernel combines statistical and linguistic concepts. Indeed, the latter kernel uses latent concepts extracted by the Latent Semantic Analysis (LSA) in the CVSM. The kernels were evaluated on a text categorization task in the biomedical domain. The Ohsumed corpus, well known for being difficult to categorize, was used. The results have shown that the CVSM improves performance compared to the VSM.
Databáze: OpenAIRE