Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation

Autor: Sun Maosong, Xu Dongliang, Benjamin K. T'Sou, Lu Huaming
Rok vydání: 2008
Předmět:
Zdroj: Text, Speech and Dialogue ISBN: 9783540873907
TSD
DOI: 10.1007/978-3-540-87391-4_20
Popis: This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction.
Databáze: OpenAIRE