Autor: |
Bazzo, Guilherme Torresan, Lorentz, Gustavo Acauan, Suarez Vargas, Danny, Moreira, Viviane P. |
Jazyk: |
angličtina |
Rok vydání: |
2020 |
Předmět: |
|
Zdroj: |
Advances in Information Retrieval |
Popis: |
A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is needed. In this empirical study, we simulate OCR errors and investigate the impact that misspelled words have on retrieval accuracy. In order to quantify such impact, errors were systematically inserted at varying rates in an initially clean IR collection. Our results showed that significant impacts are noticed starting at a 5% error rate. Furthermore, stemming has proven to make systems more robust to errors. |
Databáze: |
OpenAIRE |
Externí odkaz: |
|