IHR-NomDB: The Old Degraded Vietnamese Handwritten Script Archive Database

Autor: Van Linh Le, Manh Tu Vu, Marie Beurton-Aimar
Rok vydání: 2021
Předmět:
Zdroj: Document Analysis and Recognition – ICDAR 2021 ISBN: 9783030863333
ICDAR (3)
Popis: This paper introduces a new handwritten database IHR-NomDB, for an old Vietnamese writing system called ChuNom. Over 260 pages of ChuNom were collected from Vietnamese Nom Preservation Foundation to analyze and annotate the bounding boxes manually to generate more than 5000 patches in which containing the images of handwriting texts, the corresponding digital ChuNom characters and its translation in modern Vietnamese script. Along with this handwriting dataset is a new Synthetic Nom String dataset, which consists of 101, 621 images generated using our collected bank of ChuNom sentences. Totally, 13, 254 characters are presented on the two parts of the database, making this the first and largest publicly available database for researching in this old Vietnamese writing script. For the baseline results, we have performed the testing on the validation set of the handwriting dataset using the Convolution Recurrent Neural Network (CRNN) pretrained on the Synthetic Nom String dataset with CTC Loss and achieved \(42.70\%\) accuracy at sentence level and \(82.28\%\) accuracy at character level. The database is available to download at https://morphoboid.labri.fr/ihr-nom.html.
Databáze: OpenAIRE