Arabic Broadcast News Transcription Using a One Million Word Vocalized Vocabulary

Autor:	Jean-Luc Gauvain, Lori Lamel, Abdelkhalek Messaoudi
Rok vydání:	2006
Předmět:	Vocabulary Arabic Computer science business.industry Speech recognition media_common.quotation_subject computer.software_genre Variety (linguistics) language.human_language Vowel language Language model Artificial intelligence Transcription (software) business computer Word (computer architecture) Natural language processing Natural language media_common
Zdroj:	ICASSP (1)
DOI:	10.1109/icassp.2006.1660215
Popis:	Recently it has been shown that modeling short vowels in Arabic can significantly improve performance even when producing a non-vocalized transcript. Since Arabic texts and audio transcripts are almost exclusively non-vocalized, the training methods have to overcome this missing data problem. For the acoustic models the procedure was bootstrapped with manually vocalized data and extended with semi-automatically vocalized data. In order to also capture the vowel information in the language model, a vocalized 4-gram language model trained on the audio transcripts was interpolated with the original 4-gram model trained on the (non-vocalized) written texts. Another challenge of the Arabic language is its large lexical variety. The out-of-vocabulary rate with a 65k word vocabulary is in the range of 4-8% (compared to under 1% for English). To address this problem a vocalized vocabulary containing over 1 million vocalized words, grouped into 200k word classes is used. This reduces the out-of-vocabulary rate to about 2%. The extended vocabulary and vocalized language model trained on the manually annotated data give a 1.2% absolute word error reduction on the DARPA RT04 development data. However, including the automatically vocalized transcripts in the language model reduces performance indicating that automatic vocalization needs to be improved.*†Visiting scientist from the Vecsys Company.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::4d4bdb38f62083bffa4aa4e4ff1ef686 https://doi.org/10.1109/icassp.2006.1660215 Zobrazit plný text záznamu