Transcription Alignment of Historical Vietnamese Manuscripts without Human-Annotated Learning Samples

Autor: Anna Scius-Bertrand, Michael Jungo, Beat Wolf, Andreas Fischer, Marc Bui
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Zdroj: Applied Sciences, Vol 11, Iss 11, p 4894 (2021)
Druh dokumentu: article
ISSN: 11114894
2076-3417
DOI: 10.3390/app11114894
Popis: The current state of the art for automatic transcription of historical manuscripts is typically limited by the requirement of human-annotated learning samples, which are are necessary to train specific machine learning models for specific languages and scripts. Transcription alignment is a simpler task that aims to find a correspondence between text in the scanned image and its existing Unicode counterpart, a correspondence which can then be used as training data. The alignment task can be approached with heuristic methods dedicated to certain types of manuscripts, or with weakly trained systems reducing the required amount of annotations. In this article, we propose a novel learning-based alignment method based on fully convolutional object detection that does not require any human annotation at all. Instead, the object detection system is initially trained on synthetic printed pages using a font and then adapted to the real manuscripts by means of self-training. On a dataset of historical Vietnamese handwriting, we demonstrate the feasibility of annotation-free alignment as well as the positive impact of self-training on the character detection accuracy, reaching a detection accuracy of 96.4% with a YOLOv5m model without using any human annotation.
Databáze: Directory of Open Access Journals