Printed Persian OCR system using deep learning

Autor: Marziye Rahmati, Mansoor Fateh, Mohsen Rezvani, Alireza Tajary, Vahid Abolghasemi
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Zdroj: IET Image Processing, Vol 14, Iss 15, Pp 3920-3931 (2020)
Druh dokumentu: article
ISSN: 1751-9667
1751-9659
DOI: 10.1049/iet-ipr.2019.0728
Popis: Optical character recognition, known as OCR, has been widely used due to high demand of different technologies. Currently, most existing OCR systems have been focused on Latin languages. In recent studies, OCR systems for non‐Latin texts involving cursive style have also been introduced despite posing some challenges. In this study, the authors propose an OCR system based on long short‐term memory neural networks for the Persian language. The authors also investigate the effects of variations of parameters, involved in this approach. The proposed OCR system solves false recognition of sub‐word ‘LA’ and ‘LA’. Moreover, the authors present a preprocessing algorithm to remove ‘justification’ using image processing. A new comprehensive collated data set is introduced, comprising five million images with eight popular Persian fonts and in ten various font sizes. The proposed evaluations show that the accuracy of the proposed OCR is increased by 2%, compared to the existing Persian OCR system. The experimental results indicated that the proposed system has average accuracy of 99.69% at the letter level. The proposed system has an accuracy of 98.1% for ‘zero‐width non‐breaking space’ and 98.64% for ‘LA’ at the word level.
Databáze: Directory of Open Access Journals