Towards a Better Semantic Matching for Indexation Improvement of Error-Prone (Semi-)Structured XML Documents
Autor: | Sylvie Calabretto, Béatrice Rumpler, Arnaud Renard |
---|---|
Přispěvatelé: | Distribution, Recherche d'Information et Mobilité (DRIM), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2), Joaquim Filipe, José Cordeiro |
Rok vydání: | 2011 |
Předmět: |
Thesaurus (information retrieval)
Information retrieval Computer science computer.internet_protocol media_common.quotation_subject 05 social sciences 02 engineering and technology Ambiguity Ontology (information science) Semantics Ranking (information retrieval) Identification (information) 020204 information systems 0202 electrical engineering electronic engineering information engineering [INFO]Computer Science [cs] 0509 other social sciences 050904 information & library sciences computer XML media_common Semantic matching |
Zdroj: | Lecture Notes in Business Information Processing ISBN: 9783642228094 WEBIST (Selected Papers) Lecture Notes in Business Information Processing (LNBIP) Joaquim Filipe, José Cordeiro. Lecture Notes in Business Information Processing (LNBIP), Springer-Verlag, pp.286-298, 2011, ⟨10.1007/978-3-642-22810-0_21⟩ |
DOI: | 10.1007/978-3-642-22810-0_21 |
Popis: | International audience; Documents containing errors in their textual content (which we will call noisy documents) are difficultly handled by Information Retrieval systems. The same observation is verified when it comes to (semi-)structured IR systems this paper deals with. However, the problem is even bigger when those systems rely on Semantics. In order to achieve that, they need an additional external semantic resource related to the documents collection. Then, ranking is made possible thanks to concepts comparisons allowed by similarity measures. Similarity measures assume that concepts related to the words have been identified without ambiguity. Nevertheless, this assumption can't be made in presence of noisy documents where words are potentially misspelled, resulting in a word having a different meaning or at least in a non-word. Semantic aware (semi-)structured IR systems lay on basic concept identification but they don’t care about spelling uncertainties. As this can degrade systems results, we suggest a way to detect and correct misspelled terms which can be used in documents pre-processing of IR systems. First results on small datasets seem promising. |
Databáze: | OpenAIRE |
Externí odkaz: |