Word-like character n-gram embedding

Autor:	Kazuki Fukui, Geewook Kim, Hidetoshi Shimodaira
Rok vydání:	2018
Předmět:	Vocabulary Word embedding business.industry Computer science media_common.quotation_subject Text segmentation 02 engineering and technology 010501 environmental sciences computer.software_genre 01 natural sciences Word lists by frequency n-gram Character (mathematics) 020204 information systems 0202 electrical engineering electronic engineering information engineering Embedding Artificial intelligence business computer Natural language processing Word (computer architecture) 0105 earth and related environmental sciences media_common
Zdroj:	NUT@EMNLP
DOI:	10.18653/v1/w18-6120
Popis:	We propose a new word embedding method called word-like character n-gram embedding, which learns distributed representations of words by embedding word-like character n-grams. Our method is an extension of recently proposed segmentation-free word embedding, which directly embeds frequent character n-grams from a raw corpus. However, its n-gram vocabulary tends to contain too many non-word n-grams. We solved this problem by introducing an idea of expected word frequency. Compared to the previously proposed methods, our method can embed more words, along with the words that are not included in a given basic word dictionary. Since our method does not rely on word segmentation with rich word dictionaries, it is especially effective when the text in the corpus is in unsegmented language and contains many neologisms and informal words (e.g., Chinese SNS dataset). Our experimental results on Sina Weibo (a Chinese microblog service) and Twitter show that the proposed method can embed more words and improve the performance of downstream tasks.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::576d7264ce0a40340cddba3a3c02a657 https://doi.org/10.18653/v1/w18-6120 Zobrazit plný text záznamu