Vector sentences representation for data selection in statistical machine translation

Autor:	Francisco Casacuberta, Mara Chinea-Rios, Germán Sanchis-Trilles
Jazyk:	angličtina
Rok vydání:	2019
Předmět:	Statistical machine translation Computer science Continuous vector-space representation 02 engineering and technology computer.software_genre Translation (geometry) 01 natural sciences Theoretical Computer Science Data selection 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Cross-entropy 010301 acoustics Infrequent ngrams recovery business.industry Representation (systemics) 020206 networking & telecommunications Human-Computer Interaction Cross entropy Artificial intelligence business computer LENGUAJES Y SISTEMAS INFORMATICOS Software Natural language processing
Zdroj:	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia instname
Popis:	[EN] One of the most popular approaches to machine translation consists in formulating the problem as a pattern recognition approach. Under this perspective, bilingual corpora are precious resources, as they allow for a proper estimation of the underlying models. In this framework, selecting the best possible corpus is critical, and data selection aims to find the best subset of the bilingual sentences from an available pool of sentences such that the final translation quality is improved. In this paper, we present a new data selection technique that leverages a continuous vector-space representation of sentences. Experimental results report improvements compared not only with a system trained only with in-domain data, but also compared with a system trained on all the available data. Finally, we compared our proposal with other state-of-the-art data selection techniques (Cross-entropy selection and Infrequent ngrams recovery) in two different scenarios, obtaining very promising results with our proposal: our data selection strategy is able to yield results that are at least as good as the best-performing strfategy for each scenario. The empirical results reported are coherent across different language pairs. Work supported by the Generalitat Valenciana under grant ALMAMATER (PrometeoII/2014/030) and the FPI (2014) grant by Universitat Politècnica de València.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::d647e6e0e73fba0da9d49a4e0cfeb7ed http://hdl.handle.net/10251/155404 Zobrazit plný text záznamu