Learning from mistakes: Improving spelling correction performance with automatic generation of realistic misspellings.

Autor: Büyük, Osman, Arslan, Levent M.
Předmět:
Zdroj: Expert Systems; Aug2021, Vol. 38 Issue 5, p1-16, 16p
Abstrakt: Sequence to sequence models (seq2seq) require a large amount of labelled training data to learn the mapping between the input and output. A large set of misspelled words together with their corrections is needed to train a seq2seq spelling correction system. Low‐resource languages such as Turkish usually lack such large annotated datasets. Although misspelling‐reference pairs can be synthesized with a random procedure, the generated dataset may not well match to genuine human‐made misspellings. This might degrade the performance in realistic test scenarios. In this paper, we propose a novel procedure to automatically introduce human‐like misspellings to legitimate words in Turkish language. Generated human‐like misspellings are used to improve the performance of a seq2seq spelling correction system. The proposed system consists of two separate models; a misspelling generator and a spelling corrector. The generator is trained using a relatively small number of human‐made misspellings and their manual corrections. Reference words and their misspellings are used as inputs and outputs of the generator, respectively. As a result, it is trained to add realistic spelling errors to the valid words. Training data of the spelling corrector is augmented by the generator's human‐like misspellings. In the experiments, we observe that the data augmentation significantly improves the spelling correction performance. Our proposed method yields 5% absolute improvement over the state‐of‐the‐art Turkish spelling correction systems in a test set which contains human‐made misspellings from Twitter messages. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index
Nepřihlášeným uživatelům se plný text nezobrazuje