Data Selection with Feature Decay Algorithms Using an Approximated Target Side

Autor: Way, Andy, Poncelas, Alberto, Maillette de Buy Wenniger, Gideon
Přispěvatelé: Turchi, Marco, Niehues, Jan, Frederico, Marcello
Jazyk: angličtina
Rok vydání: 2018
Předmět:
Zdroj: Poncelas, Alberto ORCID: 0000-0002-5089-1687 , Maillette de Buy Wenniger, Gideon and Way, Andy ORCID: 0000-0001-5736-5930 (2018) Data selection with feature decay algorithms using an approximated target side. In: The 15th International Workshop on Spoken Language Translation 2018, 29-30 Oct 2018, Bruges, Belgium.
Way, Andy ORCID: 0000-0001-5736-5930 , Poncelas, Alberto ORCID: 0000-0002-5089-1687 and Maillette de Buy Wenniger, Gideon (2018) Data selection with feature decay algorithms using an approximated target side. In: 15th International Workshop on Spoken Language Translation (IWSLT 2018), 29-30 Apr 2018, Bruges, Belgium.
Popis: Data selection techniques applied to neural machine trans-lation (NMT) aim to increase the performance of a model byretrieving a subset of sentences for use as training data.One of the possible data selection techniques are trans-ductive learning methods, which select the data based on thetest set, i.e. the document to be translated. A limitation ofthese methods to date is that using the source-side test setdoes not by itself guarantee that sentences are selected withcorrect translations, or translations that are suitable given thetest-set domain. Some corpora, such as subtitle corpora, maycontain parallel sentences with inaccurate translations causedby localization or length restrictions.In order to try to fix this problem, in this paper we pro-pose to use an approximated target-side in addition to thesource-side when selecting suitable sentence-pairs for train-ing a model. This approximated target-side is built by pre-translating the source-side.In this work, we explore the performance of this generalidea for one specific data selection approach called FeatureDecay Algorithms (FDA).We train German-English NMT models on data selectedby using the test set (source), the approximated target side,and a mixture of both. Our findings reveal that models builtusing a combination of outputs of FDA (using the test setand an approximated target side) perform better than thosesolely using the test set. We obtain a statistically significantimprovement of more than 1.5 BLEU points over a modeltrained with all data, and more than 0.5 BLEU points over astrong FDA baseline that uses source-side information only.
Databáze: OpenAIRE