How to Progressively Build Thai Spelling Correction Systems?

Autor:	Anuruth Lertpiya, Tawunrat Chalothorn, Pakpoom Buabthong
Jazyk:	angličtina
Rok vydání:	2023
Předmět:	Natural language processing machine learning artificial neural networks text generation spelling correction text normalization Electrical engineering. Electronics. Nuclear engineering TK1-9971
Zdroj:	IEEE Access, Vol 11, Pp 72704-72716 (2023)
Druh dokumentu:	article
ISSN:	2169-3536
DOI:	10.1109/ACCESS.2023.3295004
Popis:	Neural-based sequence-to-sequence methods (Seq2Seq) have proven to be highly effective for Context-sensitive Thai spelling correction. However, they also inherit the drawbacks of Seq2Seq, such as a fixed vocabulary and large data requirements. However, dictionary-based methods and their typical applications are insufficiently robust to produce corrections with reduced error rates. These drawbacks inhibit the application of these methods in a broader range of use cases. In this paper, we provide a practical guide on how to build correction systems progressively and efficiently with three main contributions. First, we present a process for efficiently and progressively producing training data for both neural-based and dictionary-based methods. Our annotation process enables existing methods to be trained with only two percent of the data hand annotated. Second, we propose the Extendable Neural Contextual Corrector (XNCC), a novel text correction approach that decouples the dictionary from the neural model. This enables the dictionary to be extended post-training. Finally, we compare text correction systems with various configurations to demonstrate how these systems can be effectively used to produce corrections. Our experiments show that 1) minor changes to dictionary-based methods can significantly improve correction performance, 2) neural-based correction systems can be trained using a fraction of the data, and 3) XNCC can have the dictionary extended to generalize to new data without re-training. Lastly, we provide recommendations for progressively building text correction systems at multiple levels of implementation effort based on our findings.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/3b37c1fb1f67481a83321c1a4186a706 Zobrazit plný text záznamu View record in DOAJ