CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image
Autor: | Subhadip Basu, Mahantapas Kundu, Dipak Kumar Basu, Ram Sarkar, Mita Nasipuri, Nibaran Das |
---|---|
Rok vydání: | 2011 |
Předmět: |
Ground truth
Database business.industry Computer science Optical character recognition computer.software_genre language.human_language Computer Science Applications Bengali Scripting language Pattern recognition (psychology) ComputingMethodologies_DOCUMENTANDTEXTPROCESSING language Code (cryptography) Segmentation Computer Vision and Pattern Recognition Artificial intelligence Line (text file) business computer Software Natural language processing |
Zdroj: | International Journal on Document Analysis and Recognition (IJDAR). 15:71-83 |
ISSN: | 1433-2825 1433-2833 |
DOI: | 10.1007/s10032-011-0148-6 |
Popis: | In this paper, we have described the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of handwritten Bangla text and Bangla text mixed with English words. This is the first handwritten database in this area, as mentioned above, available as an open source document. As India is a multi-lingual country and has a colonial past, so multi-script document pages are very much common. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with English words. This database for off-line-handwritten scripts is collected from different data sources. After collecting the document pages, all the documents have been preprocessed and distributed into two groups, i.e., CMATERdb1.1.1, containing document pages written in Bangla script only, and CMATERdb1.2.1, containing document pages written in Bangla text mixed with English words. Finally, we have also provided the useful ground truth images for the line segmentation purpose. To generate the ground truth images, we have first labeled each line in a document page automatically by applying one of our previously developed line extraction techniques [Khandelwal et al., PReMI 2009, pp. 369–374] and then corrected any possible error by using our developed tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two databases, respectively, using our algorithm. Both the databases along with the ground truth annotations and the ground truth generating tool are available freely at http://code.google.com/p/cmaterdb. |
Databáze: | OpenAIRE |
Externí odkaz: |