Word Segmentation for Chinese Novels

Autor: Likun Qiu, Yue Zhang
Rok vydání: 2015
Předmět:
Zdroj: Proceedings of the AAAI Conference on Artificial Intelligence. 29
ISSN: 2374-3468
2159-5399
Popis: Word segmentation is a necessary first step for automaticsyntactic analysis of Chinese text. Chinese segmentationis highly accurate on news data, but the accuraciesdrop significantly on other domains, such as science andliterature. For scientific domains, a significant portionof out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentationsignificantly. For the literature domain, however,there is not a fixed set of domain terms. For example,each novel can contain a specifiac set of person, organizationand location names. We investigate a method forautomatically mining common noun entities for eachnovel using information extraction techniques, and usethe resulting entities to improve a state-of-the-art segmentationmodel for the novel. In particular, we designa novel double-propagation algorithm that mines nounentities together with common contextual patterns, anduse them as plug-in features to a model trained on thesource domain. An advantage of our method is that noretraining for the segmentation model is needed for eachnovel, and hence it can be applied efficiently given thehuge number of novels on the web. Results on five differentnovels show significantly improved accuracies,in particular for OOV words.
Databáze: OpenAIRE