Bilingual Data Selection Using a Continuous Vector-Space Representation

Autor:	Mara Chinea-Rios, Francisco Casacuberta, Germán Sanchis-Trilles
Rok vydání:	2016
Předmět:	business.industry Computer science Representation (systemics) Pattern recognition Pattern recognition system computer.software_genre Bilingual corpus Artificial intelligence Machine translation system Vector space representation business computer Data selection Natural language processing Word (computer architecture)
Zdroj:	Lecture Notes in Computer Science ISBN: 9783319490540 S+SSPR
DOI:	10.1007/978-3-319-49055-7_9
Popis:	Data selection aims to select the best data subset from an available pool of sentences with which to train a pattern recognition system. In this article, we present a bilingual data selection method that leverages a continuous vector-space representation of word sequences for selecting the best subset of a bilingual corpus, for the application of training a machine translation system. We compared our proposal with a state-of-the-art data selection technique (cross-entropy) obtaining very promising results, which were coherent across different language pairs.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::3873b7f592d978ee469a61d36efa619c https://doi.org/10.1007/978-3-319-49055-7_9 Zobrazit plný text záznamu