Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation
Autor: | Sun Maosong, Xu Dongliang, Benjamin K. T'Sou, Lu Huaming |
---|---|
Rok vydání: | 2008 |
Předmět: |
Thesaurus (information retrieval)
Computer science business.industry Speech recognition Word formation computer.software_genre Character (mathematics) Artificial intelligence Chinese word Chinese characters Semantic information Hidden Markov model business computer Word (computer architecture) Natural language processing |
Zdroj: | Text, Speech and Dialogue ISBN: 9783540873907 TSD |
DOI: | 10.1007/978-3-540-87391-4_20 |
Popis: | This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction. |
Databáze: | OpenAIRE |
Externí odkaz: |