Arabic Text Recognition Using a Script-Independent Methodology: A Unified HMM-Based Approach for Machine-Printed and Handwritten Text
Autor: | Rohit Prasad, Shiv Vitaladevuni, Premkumar Natarajan, David Belanger, Huaigu Cao, Krishna Subramanian, Matin Kamali, Ehry MacRostie, Shirin Saleem |
---|---|
Rok vydání: | 2012 |
Předmět: |
Computer science
business.industry Speech recognition Feature extraction Glyph Optical character recognition computer.software_genre ComputingMethodologies_PATTERNRECOGNITION Scripting language Handwriting recognition ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Artificial intelligence Hidden Markov model business computer Cursive Arabic script Natural language processing |
Zdroj: | Guide to OCR for Arabic Scripts ISBN: 9781447140719 |
DOI: | 10.1007/978-1-4471-4072-6_20 |
Popis: | We describe BBN’s script-independent methodology for multilingual machine-print OCR and offline handwriting recognition (HWR) based on the use of hidden Markov models (HMM). The feature extraction, training, and recognition components of the system are all designed to be script-independent. The HMM training and recognition components are based on BBN’s Byblos hidden Markov modeling software. The HMM parameters are estimated automatically from the training data, without the need for laborious manually created rules. The system does not require any pre-segmentation of the data, either at the word level or at the character level. Thus, the system can handle languages with cursive handwritten scripts in a straightforward manner. The script independence of the system is demonstrated with experimental results in three scripts that exhibit significant differences in glyph characteristics: Arabic, Chinese, and English. Experimental results demonstrating the viability of the proposed methodology are presented. Offline HWR of free-flowing Arabic text is a challenging task due to the plethora of factors that contribute to the variability in the data. In light of this book’s focus on Arabic scripts, we address some of these sources of variability, and present experimental results on a large corpus of handwritten documents. Experimental results are provided for specific techniques such as the application of context-dependent HMMs for the cursive Arabic script and unsupervised adaptation to account for the stylistic variations across scribes/writers. We also present an innovative integration of structural features in the HMM framework which results in a 10 % relative improvement in performance. We conclude with a new technique for dealing with noise related to the dots that are an integral yet disconnected part of many Arabic characters. |
Databáze: | OpenAIRE |
Externí odkaz: |