MIDV-LAIT: A Challenging Dataset for Recognition of IDs with Perso-Arabic, Thai, and Indian Scripts

Autor: Yulia S. Chernyshova, Vladimir V. Arlazarov, Alexander Sheshkus, Ekaterina Emelianova
Rok vydání: 2021
Předmět:
Zdroj: Document Analysis and Recognition – ICDAR 2021 ISBN: 9783030863302
ICDAR (2)
DOI: 10.1007/978-3-030-86331-9_17
Popis: In this paper, we present a new dataset for identity documents (IDs) recognition called MIDV-LAIT. The main feature of the dataset is the textual fields in Perso-Arabic, Thai, and Indian scripts. Since open datasets with real IDs may not be published, we synthetically generated all the images and data. Even faces are generated and do not belong to any particular person. Recently some datasets have appeared for evaluation of the IDs detection, type identification, and recognition, but these datasets cover only Latin-based and Cyrillic-based languages. The proposed dataset is to fix this issue and make it easier to evaluate and compare various methods. As a baseline, we process all the textual field images in MIDV-LAIT with Tesseract OCR. The resulting recognition accuracy shows that the dataset is challenging and is of use for further researches.
Databáze: OpenAIRE