Robust named entity detection from optical character recognition output

Autor:	Prem Natarajan, Krishna Subramanian, Rohit Prasad
Rok vydání:	2011
Předmět:	Computer science business.industry Speech recognition Computation Pattern recognition Optical character recognition computer.software_genre Intelligent word recognition Computer Science Applications Named entity Information extraction Confidence measures ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Computer Vision and Pattern Recognition Artificial intelligence Named entity detection Hidden Markov model business computer Software
Zdroj:	International Journal on Document Analysis and Recognition (IJDAR). 14:189-200
ISSN:	1433-2825 1433-2833
DOI:	10.1007/s10032-011-0150-z
Popis:	In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::b8328806d4ed1679644c6a934a284626 https://doi.org/10.1007/s10032-011-0150-z Zobrazit plný text záznamu Full text from SpringerLink