Zobrazeno 1 - 10
of 67
pro vyhledávání: '"Virpioja, Sami"'
This paper introduces Bayesian uncertainty modeling using Stochastic Weight Averaging-Gaussian (SWAG) in Natural Language Understanding (NLU) tasks. We apply the approach to standard tasks in natural language inference (NLI) and demonstrate the effec
Externí odkaz:
http://arxiv.org/abs/2304.04726
Autor:
Tiedemann, Jörg, Aulamo, Mikko, Bakshandaeva, Daria, Boggia, Michele, Grönroos, Stig-Arne, Nieminen, Tommi, Raganato, Alessandro, Scherrer, Yves, Vazquez, Raul, Virpioja, Sami
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission o
Externí odkaz:
http://arxiv.org/abs/2212.01936
Creating open-domain chatbots requires large amounts of conversational data and related benchmark tasks to evaluate them. Standardized evaluation tasks are crucial for creating automatic evaluation metrics for model development; otherwise, comparing
Externí odkaz:
http://arxiv.org/abs/2008.08315
Character-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training times in comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-char
Externí odkaz:
http://arxiv.org/abs/2007.11648
In spoken Keyword Search, the query may contain out-of-vocabulary (OOV) words not observed when training the speech recognition system. Using subword language models (LMs) in the first-pass recognition makes it possible to recognize the OOV words, bu
Externí odkaz:
http://arxiv.org/abs/2005.13827
There are several approaches for improving neural machine translation for low-resource languages: Monolingual data can be exploited via pretraining or data augmentation; Parallel corpora on related language pairs can be used via parameter sharing or
Externí odkaz:
http://arxiv.org/abs/2004.04002
Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely ad
Externí odkaz:
http://arxiv.org/abs/2003.03131
Autor:
Talman, Aarne, Sulubacak, Umut, Vázquez, Raúl, Scherrer, Yves, Virpioja, Sami, Raganato, Alessandro, Hurskainen, Arvi, Tiedemann, Jörg
In this paper, we present the University of Helsinki submissions to the WMT 2019 shared task on news translation in three language pairs: English-German, English-Finnish and Finnish-English. This year, we focused first on cleaning and filtering the t
Externí odkaz:
http://arxiv.org/abs/1906.04040
This article describes the Aalto University entry to the WMT18 News Translation Shared Task. We participate in the multilingual subtrack with a system trained under the constrained condition to translate from English to both Finnish and Estonian. The
Externí odkaz:
http://arxiv.org/abs/1808.10791
Publikováno v:
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 11, pp. 2085-2097, November 2017
Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usabi
Externí odkaz:
http://arxiv.org/abs/1707.04227