HMMs for Unsupervised Vietnamese Word Segmentation
Autor: | Huu-Hoang Nguyen, Thi-Trang Nguyen, Kiem-Hieu Nguyen, Ba-Long Bui |
---|---|
Rok vydání: | 2019 |
Předmět: |
Computer science
Speech recognition Vietnamese Supervised learning Text segmentation Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) Pointwise mutual information Viterbi algorithm language.human_language symbols.namesake Viterbi decoder language symbols Hidden Markov model Word (computer architecture) |
Zdroj: | RIVF |
Popis: | Word segmentation is an important problem in natural language processing. Most of previous works on Vietnamese word segmentation are supervised learning. In this paper, we propose an unsupervised method for Vietnamese word segmentation based on Hidden Markov Models. We naturally encode prior linguistic knowledge into model learning. In decoding, we propose an enhancement of Viterbi decoding algorithm with external token ordering statistics from Pointwise Mutual Information. Evaluation on benchmark datasets shows that the proposed method works reasonably well. Sourcecode is available at https://github.com/longbb/word_recognition |
Databáze: | OpenAIRE |
Externí odkaz: |