Implementation of an Optical Character Reader (OCR) for Bengali language

Autor:	Baijed Hossain Bipul, Md. Khalilur Rhaman, Muhammed Tawfiq Chowdhury, Md. Saiful Islam
Rok vydání:	2015
Předmět:	business.industry Character (computing) Computer science computer.file_format Optical character recognition computer.software_genre language.human_language Intelligent word recognition Bengali Font ComputingMethodologies_DOCUMENTANDTEXTPROCESSING language Tesseract Artificial intelligence Image file formats business computer Natural language processing Sentence
Zdroj:	2015 International Conference on Data and Software Engineering (ICoDSE).
Popis:	Optical Character Recognition (OCR) is the process of extracting text from an image. The main purpose of an OCR is to make editable documents from existing paper documents or image files. Significant number of algorithms is required to develop an OCR and basically it works in two phases such as character and word detection. In case of a more sophisticated approach, an OCR also works on sentence detection to preserve a document's structure. It has been found that researchers put lots of efforts for developing a Bengali OCR but none of them is completely error free. To take this issue in consideration, the latest 3.03 version of Tesseract OCR engine for Windows operating system is used to develop an OCR for Bengali language. Moreover, 18110 characters and 2617 words are used to make the OCR's library. In this research, ‘Solaimanlipi’ font and 200 input files are used to test the accuracy of OCR. It is found that for clean image files, the accuracy of the software is as high as 97.56%. It is to be noted that accuracy is measured as the percentage of correct characters and words.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::01a5b9413dab7d001f5d1c703352d9ff https://doi.org/10.1109/icodse.2015.7436984 Zobrazit plný text záznamu