Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition

Autor: Darko Pekar, Branislav Popović, Edvin Pakoci
Rok vydání: 2019
Předmět:
Vocabulary
Article Subject
General Computer Science
Computer science
General Mathematics
media_common.quotation_subject
02 engineering and technology
lcsh:Computer applications to medicine. Medical informatics
computer.software_genre
Semantics
lcsh:RC321-571
03 medical and health sciences
0302 clinical medicine
0202 electrical engineering
electronic engineering
information engineering

Humans
Speech
lcsh:Neurosciences. Biological psychiatry. Neuropsychiatry
Lemma (morphology)
media_common
Dictation
business.industry
General Neuroscience
Recognition
Psychology

General Medicine
language.human_language
Grammatical number
Speech Perception
language
lcsh:R858-859.7
020201 artificial intelligence & image processing
Artificial intelligence
Language model
Serbian
business
computer
030217 neurology & neurosurgery
Word (computer architecture)
Natural language processing
Research Article
Zdroj: Computational Intelligence and Neuroscience
Computational Intelligence and Neuroscience, Vol 2019 (2019)
ISSN: 1687-5273
1687-5265
DOI: 10.1155/2019/5072918
Popis: Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. This effect is larger for contexts not present in the language model training corpus. In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. These categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. These additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. The proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics.
Databáze: OpenAIRE
Nepřihlášeným uživatelům se plný text nezobrazuje