A tailored Handwritten-Text-Recognition System for Medieval Latin

Autor: Koch, Philipp, Nuñez, Gilary Vera, Arias, Esteban Garces, Heumann, Christian, Schöffel, Matthias, Häberlin, Alexander, Aßenmacher, Matthias
Rok vydání: 2023
Předmět:
Druh dokumentu: Working Paper
Popis: The Bavarian Academy of Sciences and Humanities aims to digitize its Medieval Latin Dictionary. This dictionary entails record cards referring to lemmas in medieval Latin, a low-resource language. A crucial step of the digitization process is the Handwritten Text Recognition (HTR) of the handwritten lemmas found on these record cards. In our work, we introduce an end-to-end pipeline, tailored to the medieval Latin dictionary, for locating, extracting, and transcribing the lemmas. We employ two state-of-the-art (SOTA) image segmentation models to prepare the initial data set for the HTR task. Furthermore, we experiment with different transformer-based models and conduct a set of experiments to explore the capabilities of different combinations of vision encoders with a GPT-2 decoder. Additionally, we also apply extensive data augmentation resulting in a highly competitive model. The best-performing setup achieved a Character Error Rate (CER) of 0.015, which is even superior to the commercial Google Cloud Vision model, and shows more stable performance.
Comment: This paper has been accepted at the First Workshop on Ancient Language Processing, co-located with RANLP 2023. This is the author's version of the work. The definite version of record will be published in the proceedings
Databáze: arXiv