Popis: |
Since the advent of deep learning, automatic speech recognition (ASR), like many other fields, has advanced significantly. Both the acoustic model and the language model are now based on artificial neural networks which has yielded drastic improvements in recognition accuracy. However, whereas the state-of-the-art acoustic models have been integrated directly into the core of most if not all speech recognizers, the same cannot be said of language models. Current speech recognizers are still employing n-gram language models, models that were developed during the 1970s, because they achieve reasonable accuracy and, more importantly, because they are extremely efficient. More advanced language models such as the current state-of-the-art recurrent neural network language models (RNNLMs) on the other hand are computationally expensive and can therefore only be applied in a multi-pass recognition: the model is used only in a second pass to rescore the output of a first pass that uses n-grams. This is suboptimal for both speed and accuracy and indicates that there is a clear demand for alternatives. In this work, we propose five such alternatives to improve upon regular n-grams, while striving towards minimal computational complexity. First, we improve the prediction accuracy of n-gram language models without sacrificing their efficiency. To this end, we propose a class-based n-gram language model that uses compound-head clusters as classes. We argue that compounds are well represented by their head which alleviates the overgeneralization that class-based models usually suffer from. We present a clustering algorithm that is capable of detecting the head of a compound with high precision and we use aggregated statistics to model both unseen compounds as well as infrequent compounds. Our technique is validated experimentally on the Dutch CGN corpus and shows significant word error rate reductions compared to regular n-gram models. In a second proposal, we overcome the inability of n-grams to capture long-distance relations between words by combining them with semantic language models i.e. models that are able to detect semantic similarities between words. We conduct a thorough investigation of two existing semantic language models, namely cache models and models based on Latent Semantic Analysis (LSA), and compare them to a novel semantic model that is based on word embeddings. Not only does our proposed model consistently achieve higher prediction accuracy on Dutch newspaper and magazine data, it is also twice as fast as the model based on LSA and combines well with cache models. Another approach to modeling long-distance dependencies is proposed in the form of a novel estimation technique, called Sparse Non-negative Matrix (SNM) estimation. This technique is able to incorporate arbitrary features, yet scales gracefully to large amounts of data. We show that SNM language models trained with n-gram features are a close match for the well-established Kneser-Ney models and that the addition of skip-gram features yields a model that is in the same league as the state-of-the-art RNNLM, as well as complementary. The model is validated experimentally and shows excellent results on two different data sets: Google's One Billion Word Benchmark and a smaller subset of the LDC Gigaword corpus. Moreover, we show that a first implementation is already 10x faster than a dedicated implementation of an RNNLM. Efficient language modeling can also be achieved by adapting the recognizer such that it can spend more resources on the language model. To this end, we propose a layered architecture, that uses the output of a first acoustic layer as input for a second word decoding layer. This decoupling alleviates the task of the decoder which makes it possible to apply more complex language models. We show on the Dutch N-Best benchmark that, although we have not exploited its full potential, the architecture is already competitive to an all-in-one approach in which acoustic model, language model and lexicon are all applied simultaneously. Finally, we propose a novel language model adaptation technique that can be applied to ASR of spoken translations. The technique consists of n-gram probability inflation using exponential weights based on translation model probabilities which reduces the number of updates. It does not enforce probability renormalization and reduces data storage and memory load by storing only the update weights. We validate this technique experimentally and show that it achieves a significant word error rate reduction on spoken Dutch translations from English, while having little to no negative effect on recognition time which allows its use in a real-time computer-aided translation environment. Pelemans J., ''Efficient language modeling for automatic speech recognition'', Proefschrift voorgedragen tot het behalen van het doctoraat in de ingenieurswetenschappen, KU Leuven, May 2017, Leuven, Belgium. status: published |