A Study of Chinese Word Segmentation under LangGeh orthography

Autor: Wang, Jian Jie, 王建傑
Rok vydání: 2013
Druh dokumentu: 學位論文 ; thesis
Popis: 101
The concept of words in Mandarin Chinese is not really well defined. And as a result the important basic word segmentation module of the natural language processing of Chinese becomes somewhat difficult to implement. The primary standard of word segmentation in Taiwan is the CKIP standard of Academia Sinica, which uses semantics, syntax, and usage frequency to define a word. We propose an added principle of singleton-avoiding that dictates minimizing single character word in a segmented text. More specifically, two character string and three character string are principally treated as a word. By making use of the number of characters in defining a word, the standard becomes easy to follow. Furthermore, by writing the Chinese sentences with spaces between simple short phrases (called LangGeh orthography) instead of traditional way of no spaces in-between, and the segmentation module becomes much easier to implement. An implemented segmentation module written in programming language Python is tested on a testing text corpus of around 30000 characters, collected from internet and transformed into LangGeh orthography. The resulting performance is 98% in F-measure, and compared quite favorably to the traditional word segmentation of about 96% using large amount of training data. For marginalized languages such as Taiwanese and Hakka, LangGeh and the new segmentation standard seem to be the way to follow. Keywords: Chinese word segmentation, singleton-avoiding principle, LangGeh orthography, segmentation standard.
Databáze: Networked Digital Library of Theses & Dissertations