How to Progressively Build Thai Spelling Correction Systems?

Autor: Anuruth Lertpiya, Tawunrat Chalothorn, Pakpoom Buabthong
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: IEEE Access, Vol 11, Pp 72704-72716 (2023)
Druh dokumentu: article
ISSN: 2169-3536
DOI: 10.1109/ACCESS.2023.3295004
Popis: Neural-based sequence-to-sequence methods (Seq2Seq) have proven to be highly effective for Context-sensitive Thai spelling correction. However, they also inherit the drawbacks of Seq2Seq, such as a fixed vocabulary and large data requirements. However, dictionary-based methods and their typical applications are insufficiently robust to produce corrections with reduced error rates. These drawbacks inhibit the application of these methods in a broader range of use cases. In this paper, we provide a practical guide on how to build correction systems progressively and efficiently with three main contributions. First, we present a process for efficiently and progressively producing training data for both neural-based and dictionary-based methods. Our annotation process enables existing methods to be trained with only two percent of the data hand annotated. Second, we propose the Extendable Neural Contextual Corrector (XNCC), a novel text correction approach that decouples the dictionary from the neural model. This enables the dictionary to be extended post-training. Finally, we compare text correction systems with various configurations to demonstrate how these systems can be effectively used to produce corrections. Our experiments show that 1) minor changes to dictionary-based methods can significantly improve correction performance, 2) neural-based correction systems can be trained using a fraction of the data, and 3) XNCC can have the dictionary extended to generalize to new data without re-training. Lastly, we provide recommendations for progressively building text correction systems at multiple levels of implementation effort based on our findings.
Databáze: Directory of Open Access Journals