A tailored Handwritten-Text-Recognition System for Medieval Latin
Autor: | Koch, Philipp, Nuñez, Gilary Vera, Arias, Esteban Garces, Heumann, Christian, Schöffel, Matthias, Häberlin, Alexander, Aßenmacher, Matthias |
---|---|
Rok vydání: | 2023 |
Předmět: | |
Druh dokumentu: | Working Paper |
Popis: | The Bavarian Academy of Sciences and Humanities aims to digitize its Medieval Latin Dictionary. This dictionary entails record cards referring to lemmas in medieval Latin, a low-resource language. A crucial step of the digitization process is the Handwritten Text Recognition (HTR) of the handwritten lemmas found on these record cards. In our work, we introduce an end-to-end pipeline, tailored to the medieval Latin dictionary, for locating, extracting, and transcribing the lemmas. We employ two state-of-the-art (SOTA) image segmentation models to prepare the initial data set for the HTR task. Furthermore, we experiment with different transformer-based models and conduct a set of experiments to explore the capabilities of different combinations of vision encoders with a GPT-2 decoder. Additionally, we also apply extensive data augmentation resulting in a highly competitive model. The best-performing setup achieved a Character Error Rate (CER) of 0.015, which is even superior to the commercial Google Cloud Vision model, and shows more stable performance. Comment: This paper has been accepted at the First Workshop on Ancient Language Processing, co-located with RANLP 2023. This is the author's version of the work. The definite version of record will be published in the proceedings |
Databáze: | arXiv |
Externí odkaz: |