HMMs for Unsupervised Vietnamese Word Segmentation

Autor:	Huu-Hoang Nguyen, Thi-Trang Nguyen, Kiem-Hieu Nguyen, Ba-Long Bui
Rok vydání:	2019
Předmět:	Computer science Speech recognition Vietnamese Supervised learning Text segmentation Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) Pointwise mutual information Viterbi algorithm language.human_language symbols.namesake Viterbi decoder language symbols Hidden Markov model Word (computer architecture)
Zdroj:	RIVF
Popis:	Word segmentation is an important problem in natural language processing. Most of previous works on Vietnamese word segmentation are supervised learning. In this paper, we propose an unsupervised method for Vietnamese word segmentation based on Hidden Markov Models. We naturally encode prior linguistic knowledge into model learning. In decoding, we propose an enhancement of Viterbi decoding algorithm with external token ordering statistics from Pointwise Mutual Information. Evaluation on benchmark datasets shows that the proposed method works reasonably well. Sourcecode is available at https://github.com/longbb/word_recognition
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::e1083bb3508a0839eece08eb669af74e https://doi.org/10.1109/rivf.2019.8713693 Zobrazit plný text záznamu