Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

Autor:	Sami Virpioja, Mikko Kurimo, Matti Varjokallio
Přispěvatelé:	Dept Signal Process and Acoust, University of Helsinki, Centre of Excellence in Computational Inference, COIN, Aalto-yliopisto, Aalto University, Language Technology
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	Vocabulary Computer science media_common.quotation_subject Bigram Morphologically rich languages 02 engineering and technology Type (model theory) computer.software_genre 01 natural sciences Theoretical Computer Science SEARCH 0103 physical sciences Inflection 0202 electrical engineering electronic engineering information engineering 6121 Languages 010301 acoustics media_common Class (computer programming) business.industry 020206 networking & telecommunications Class-based language models Estonian language.human_language Human-Computer Interaction LANGUAGE MODELS Language modelling language Language model Artificial intelligence business computer Software Word (computer architecture) Natural language processing
Popis:	We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models. We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::1f3bae0f14e2d5f621defcba2703beb9 https://aaltodoc.aalto.fi/handle/123456789/109847 Zobrazit plný text záznamu