Improving Cross-Domain
Autor: | Yuxiao Ye, Yue Zhang, Weikang Li, Jian Sun, Likun Qiu |
---|---|
Rok vydání: | 2019 |
Předmět: |
business.industry
Computer science 02 engineering and technology computer.software_genre Domain (software engineering) 030507 speech-language pathology & audiology 03 medical and health sciences Margin (machine learning) Noun 0202 electrical engineering electronic engineering information engineering Key (cryptography) Code (cryptography) 020201 artificial intelligence & image processing Segmentation Artificial intelligence Chinese word 0305 other medical science business computer Natural language processing Word (computer architecture) |
Zdroj: | NAACL-HLT (1) |
Popis: | Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measure increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsupervised cross-domain CWS approaches with a large margin. We make our data and code available on Github. |
Databáze: | OpenAIRE |
Externí odkaz: |