Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition
Autor: | Freha Mezzoudj, Denis Jouvet, David Langlois, Abdelkader Benyettou |
---|---|
Rok vydání: | 2018 |
Předmět: |
Perplexity
Computer science Speech recognition 05 social sciences Context (language use) 02 engineering and technology Domain (software engineering) 0202 electrical engineering electronic engineering information engineering Selection (linguistics) General Earth and Planetary Sciences 020201 artificial intelligence & image processing Language model 0509 other social sciences Transcription (software) 050904 information & library sciences Scope (computer science) Natural language General Environmental Science |
Zdroj: | Procedia Computer Science. 128:55-64 |
ISSN: | 1877-0509 |
DOI: | 10.1016/j.procs.2018.03.008 |
Popis: | The language model is an important module in many applications that produce natural language text, in particular speech recognition. Training of language models requires large amounts of textual data that matches with the target domain. Selection of target domain (or in-domain) data has been investigated in the past. For example [1] has proposed a criterion based on the difference of cross-entropy between models representing in-domain and non-domain-specific data. However evaluations were conducted using only two sources of data, one corresponding to the in-domain, and another one to generic data from which sentences are selected. In the scope of broadcast news and TV shows transcription systems, language models are built by interpolating several language models estimated from various data sources. This paper investigates the data selection process in this context of building interpolated language models for speech transcription. Results show that, in the selection process, the choice of the language models for representing in-domain and non-domain-specific data is critical. Moreover, it is better to apply the data selection only on some selected data sources. This way, the selection process leads to an improvement of 8.3 in terms of perplexity and 0.2% in terms of word-error rate on the French broadcast transcription task. |
Databáze: | OpenAIRE |
Externí odkaz: |