A scarce dataset for ancient Arabic handwritten text recognition

Autor: Rayyan Najam, Safiullah Faizullah
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: Data in Brief, Vol 56, Iss , Pp 110813- (2024)
Druh dokumentu: article
ISSN: 2352-3409
DOI: 10.1016/j.dib.2024.110813
Popis: Developing Deep Learning Optical Character Recognition is an active area of research, where models based on deep neural networks are trained on data to eventually extract text within an image. Even though many advances are currently being made in this area in general, the Arabic OCR domain notably lacks a dataset for ancient manuscripts. Here, we fill this gap by providing both the image and textual ground truth for a collection of ancient Arabic manuscripts. This scarce dataset is collected from the central library of the Islamic University of Madinah, and it encompasses rich text spanning different geographies across centuries. Specifically, eight ancient books with a total of forty pages, both images and text, transcribed by the experts, are present in this dataset. Particularly, this dataset holds a significant value due to the unavailability of such data publicly, which conspicuously contributes to the deep learning models development/augmenting, validation, testing, and generalization by researchers and practitioners, both for the tasks of Arabic OCR and Arabic text correction.
Databáze: Directory of Open Access Journals