Character-Based Handwritten Text Recognition of Multilingual Documents

Autor: Miguel A. del Agua, Alfons Juan, Jorge Civera, Nicolás Serrano
Jazyk: angličtina
Rok vydání: 2012
Předmět:
Zdroj: RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
instname
Advances in Speech and Language Technologies for Iberian Languages ISBN: 9783642352911
DOI: 10.1007/978-3-642-35292-8_20
Popis: [EN] An effective approach to transcribe handwritten text documents is to follow a sequential interactive approach. During the supervision phase, user corrections are incorporated into the system through an ongoing retraining process. In the case of multilingual documents with a high percentage of out-of-vocabulary (OOV) words, two principal issues arise. On the one hand, a minor yet important matter for this interactive approach is to identify the language of the current text line image to be transcribed, as a language dependent recognisers typically performs better than a monolingual recogniser. On the other hand, word-based language models suffer from data scarcity in the presence of a large number of OOV words, degrading their estimation and affecting the performance of the transcription system. In this paper, we successfully tackle both issues deploying character-based language models combined with language identification techniques on an entire 764-page multilingual document. The results obtained significantly reduce previously reported results in terms of transcription error on the same task, but showed that a language dependent approach is not effective on top of character-based recognition of similar languages.
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n◦ 287755. Also supported by the Spanish Government (MIPRCV ”Consolider Ingenio 2010”, iTrans2 TIN2009-14511, MITTRAL TIN2009-14633-C03-01 and FPU AP2007-0286) and the Generalitat Valenciana (Prometeo/2009/014).
Databáze: OpenAIRE