Character-Based Handwritten Text Recognition of Multilingual Documents
Autor: | Miguel A. del Agua, Alfons Juan, Jorge Civera, Nicolás Serrano |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2012 |
Předmět: |
Character
Language identification Computer science Handwritten Text Recognition Speech recognition Word error rate 02 engineering and technology Document processing computer.software_genre Intelligent word recognition Machine Learning Transcription (linguistics) Multilingual 0202 electrical engineering electronic engineering information engineering CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL business.industry 020207 software engineering Handwriting recognition ComputingMethodologies_DOCUMENTANDTEXTPROCESSING 020201 artificial intelligence & image processing HTR Language model Artificial intelligence Transcription error business computer LENGUAJES Y SISTEMAS INFORMATICOS Natural language processing |
Zdroj: | RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia instname Advances in Speech and Language Technologies for Iberian Languages ISBN: 9783642352911 |
DOI: | 10.1007/978-3-642-35292-8_20 |
Popis: | [EN] An effective approach to transcribe handwritten text documents is to follow a sequential interactive approach. During the supervision phase, user corrections are incorporated into the system through an ongoing retraining process. In the case of multilingual documents with a high percentage of out-of-vocabulary (OOV) words, two principal issues arise. On the one hand, a minor yet important matter for this interactive approach is to identify the language of the current text line image to be transcribed, as a language dependent recognisers typically performs better than a monolingual recogniser. On the other hand, word-based language models suffer from data scarcity in the presence of a large number of OOV words, degrading their estimation and affecting the performance of the transcription system. In this paper, we successfully tackle both issues deploying character-based language models combined with language identification techniques on an entire 764-page multilingual document. The results obtained significantly reduce previously reported results in terms of transcription error on the same task, but showed that a language dependent approach is not effective on top of character-based recognition of similar languages. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n◦ 287755. Also supported by the Spanish Government (MIPRCV ”Consolider Ingenio 2010”, iTrans2 TIN2009-14511, MITTRAL TIN2009-14633-C03-01 and FPU AP2007-0286) and the Generalitat Valenciana (Prometeo/2009/014). |
Databáze: | OpenAIRE |
Externí odkaz: |