HMMs for Unsupervised Vietnamese Word Segmentation

Autor: Huu-Hoang Nguyen, Thi-Trang Nguyen, Kiem-Hieu Nguyen, Ba-Long Bui
Rok vydání: 2019
Předmět:
Zdroj: RIVF
Popis: Word segmentation is an important problem in natural language processing. Most of previous works on Vietnamese word segmentation are supervised learning. In this paper, we propose an unsupervised method for Vietnamese word segmentation based on Hidden Markov Models. We naturally encode prior linguistic knowledge into model learning. In decoding, we propose an enhancement of Viterbi decoding algorithm with external token ordering statistics from Pointwise Mutual Information. Evaluation on benchmark datasets shows that the proposed method works reasonably well. Sourcecode is available at https://github.com/longbb/word_recognition
Databáze: OpenAIRE