Learning Multilingual Word Embeddings Using Image-Text Data

Autor:	Karan Singhal, Balder ten Cate, Karthik Raman
Rok vydání:	2019
Předmět:	FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Unification Computer Science - Artificial Intelligence Computer science business.industry Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition computer.software_genre Machine Learning (cs.LG) Image (mathematics) Artificial Intelligence (cs.AI) Semantic similarity Similarity (network science) Embedding Labeled data Artificial intelligence business Computation and Language (cs.CL) computer Natural language processing Word (computer architecture)
Zdroj:	Proceedings of the Second Workshop on Shortcomings in Vision and Language.
DOI:	10.18653/v1/w19-1807
Popis:	There has been significant interest recently in learning multilingual word embeddings -- in which semantically similar words across languages have similar embeddings. State-of-the-art approaches have relied on expensive labeled data, which is unavailable for low-resource languages, or have involved post-hoc unification of monolingual embeddings. In the present paper, we investigate the efficacy of multilingual embeddings learned from weakly-supervised image-text data. In particular, we propose methods for learning multilingual embeddings using image-text data, by enforcing similarity between the representations of the image and that of the text. Our experiments reveal that even without using any expensive labeled data, a bag-of-words-based embedding model trained on image-text data achieves performance comparable to the state-of-the-art on crosslingual semantic similarity tasks.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::1f3cd44387a9e83a9c7d60feec049c27 https://doi.org/10.18653/v1/w19-1807 Zobrazit plný text záznamu