Diacritics Restoration using BERT with Analysis on Czech language

Autor: Náplava, Jakub, Straka, Milan, Straková, Jana
Rok vydání: 2021
Předmět:
Zdroj: The Prague Bulletin of Mathematical Linguistics No. 116, 2021, pp. 27-42
Druh dokumentu: Working Paper
DOI: 10.14712/00326585.013
Popis: We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.
Databáze: arXiv