UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation

Autor: Yonghua Wen, Yantuan Xian, Yuehan Wang, Zhengtao Yu
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: Applied Sciences, Vol 14, Iss 23, p 11435 (2024)
Druh dokumentu: article
ISSN: 2076-3417
DOI: 10.3390/app142311435
Popis: Word segmentation is a critical task in natural language processing for southeast Asian Abugida languages, including Thai, Burmese, and Khmer. Existing approaches demonstrate that models using fixed-length windowed context inputs can achieve high segmentation accuracy; however, they often rely on low-level character features or language-specific preprocessing. Character-based methods can limit feature learning, while language-specific features add complexity due to specialized preprocessing requirements. This paper introduces UnifiedCut, which is a neural model that leverages multiple n-grams within a windowed multi-head attention mechanism. This design captures segmentation features from local contexts and multi-perspective n-gram inputs, enhancing generalization and recall, particularly for out-of-vocabulary words. Compared to CNN- and RNN-based approaches, UnifiedCut’s multi-head attention enables finer-grained feature extraction and greater parallelism, resulting in a faster, more scalable solution. Comprehensive experiments on public datasets for Thai, Burmese, and Khmer show that UnifiedCutachieves state-of-the-art performance in word segmentation.
Databáze: Directory of Open Access Journals