Character-Based Handwritten Text Recognition of Multilingual Documents

Autor:	Miguel A. del Agua, Alfons Juan, Jorge Civera, Nicolás Serrano
Jazyk:	angličtina
Rok vydání:	2012
Předmět:	Character Language identification Computer science Handwritten Text Recognition Speech recognition Word error rate 02 engineering and technology Document processing computer.software_genre Intelligent word recognition Machine Learning Transcription (linguistics) Multilingual 0202 electrical engineering electronic engineering information engineering CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL business.industry 020207 software engineering Handwriting recognition ComputingMethodologies_DOCUMENTANDTEXTPROCESSING 020201 artificial intelligence & image processing HTR Language model Artificial intelligence Transcription error business computer LENGUAJES Y SISTEMAS INFORMATICOS Natural language processing
Zdroj:	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia instname Advances in Speech and Language Technologies for Iberian Languages ISBN: 9783642352911
DOI:	10.1007/978-3-642-35292-8_20
Popis:	[EN] An effective approach to transcribe handwritten text documents is to follow a sequential interactive approach. During the supervision phase, user corrections are incorporated into the system through an ongoing retraining process. In the case of multilingual documents with a high percentage of out-of-vocabulary (OOV) words, two principal issues arise. On the one hand, a minor yet important matter for this interactive approach is to identify the language of the current text line image to be transcribed, as a language dependent recognisers typically performs better than a monolingual recogniser. On the other hand, word-based language models suffer from data scarcity in the presence of a large number of OOV words, degrading their estimation and affecting the performance of the transcription system. In this paper, we successfully tackle both issues deploying character-based language models combined with language identification techniques on an entire 764-page multilingual document. The results obtained significantly reduce previously reported results in terms of transcription error on the same task, but showed that a language dependent approach is not effective on top of character-based recognition of similar languages. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n◦ 287755. Also supported by the Spanish Government (MIPRCV ”Consolider Ingenio 2010”, iTrans2 TIN2009-14511, MITTRAL TIN2009-14633-C03-01 and FPU AP2007-0286) and the Generalitat Valenciana (Prometeo/2009/014).
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::9e03f93241ce4f29b49a0b04f7990f20 Zobrazit plný text záznamu