Robust named entity detection from optical character recognition output

Autor: Prem Natarajan, Krishna Subramanian, Rohit Prasad
Rok vydání: 2011
Předmět:
Zdroj: International Journal on Document Analysis and Recognition (IJDAR). 14:189-200
ISSN: 1433-2825
1433-2833
DOI: 10.1007/s10032-011-0150-z
Popis: In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.
Databáze: OpenAIRE