Bilingual Data Selection Using a Continuous Vector-Space Representation

Autor: Mara Chinea-Rios, Francisco Casacuberta, Germán Sanchis-Trilles
Rok vydání: 2016
Předmět:
Zdroj: Lecture Notes in Computer Science ISBN: 9783319490540
S+SSPR
DOI: 10.1007/978-3-319-49055-7_9
Popis: Data selection aims to select the best data subset from an available pool of sentences with which to train a pattern recognition system. In this article, we present a bilingual data selection method that leverages a continuous vector-space representation of word sequences for selecting the best subset of a bilingual corpus, for the application of training a machine translation system. We compared our proposal with a state-of-the-art data selection technique (cross-entropy) obtaining very promising results, which were coherent across different language pairs.
Databáze: OpenAIRE