Bilingual Data Selection Using a Continuous Vector-Space Representation
Autor: | Mara Chinea-Rios, Francisco Casacuberta, Germán Sanchis-Trilles |
---|---|
Rok vydání: | 2016 |
Předmět: |
business.industry
Computer science Representation (systemics) Pattern recognition Pattern recognition system computer.software_genre Bilingual corpus Artificial intelligence Machine translation system Vector space representation business computer Data selection Natural language processing Word (computer architecture) |
Zdroj: | Lecture Notes in Computer Science ISBN: 9783319490540 S+SSPR |
DOI: | 10.1007/978-3-319-49055-7_9 |
Popis: | Data selection aims to select the best data subset from an available pool of sentences with which to train a pattern recognition system. In this article, we present a bilingual data selection method that leverages a continuous vector-space representation of word sequences for selecting the best subset of a bilingual corpus, for the application of training a machine translation system. We compared our proposal with a state-of-the-art data selection technique (cross-entropy) obtaining very promising results, which were coherent across different language pairs. |
Databáze: | OpenAIRE |
Externí odkaz: |