Robust named entity detection from optical character recognition output
Autor: | Prem Natarajan, Krishna Subramanian, Rohit Prasad |
---|---|
Rok vydání: | 2011 |
Předmět: |
Computer science
business.industry Speech recognition Computation Pattern recognition Optical character recognition computer.software_genre Intelligent word recognition Computer Science Applications Named entity Information extraction Confidence measures ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Computer Vision and Pattern Recognition Artificial intelligence Named entity detection Hidden Markov model business computer Software |
Zdroj: | International Journal on Document Analysis and Recognition (IJDAR). 14:189-200 |
ISSN: | 1433-2825 1433-2833 |
DOI: | 10.1007/s10032-011-0150-z |
Popis: | In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic. |
Databáze: | OpenAIRE |
Externí odkaz: |