Popis: |
Recent advances in end-to-end speech recognition have made it possible to build multilingual models, capable of recognizing speech in multiple languages. Multilingual models can outperform their monolingual counterparts, depending on the amount of training data and the relatedness of languages. However, in some cases, these models rely on having perfect knowledge of the language being spoken; that is, they expect to be provided with an external language ID that augments the input features or modulates internal layers of the network. In this paper, we introduce a novel technique for inferring the language ID in a streaming fashion using RNN-T, and a novel loss function that pressures the model to identify the language after as few frames as possible. The output of this streaming language-ID model is used in training and inference of a multilingual recognition model. We show the effectiveness of our approach through experiments on two sets of languages, one consisting of different dialects of Arabic, and the other consisting of Nordic languages, Finnish and Dutch. |