RMWSaug: Robust Multi-window Spectrogram Augmentation Approach for Deep Learning based Speech Emotion Recognition

Autor:	Shehu Mohammed Yusuf, Emmanuel Adewale Adedokun, M. B. Mua’zu, Ime J. Umoh, Ahmed Abdul Ibrahim
Rok vydání:	2021
Předmět:	Scheme (programming language) Computer science business.industry Speech recognition Deep learning Motion capture Benchmark (computing) Spectrogram Artificial intelligence Noise (video) Transfer of learning business Environmental noise computer computer.programming_language
Zdroj:	2021 Innovations in Intelligent Systems and Applications Conference (ASYU).
DOI:	10.1109/asyu52992.2021.9598956
Popis:	Data scarcity and speech degradation due to environmental noise are two significant issues in the modelling and deployment speech emotion recognition (SER) systems. Deep learning-based SER systems overfits during modelling because of scarce training samples. Although recent attempts to tackle these issues, simultaneously, using data augmentation have yielded promising results, they are not robust enough to handle speech degradation due to real environmental noise. Thus, there is the need to further improve the classification performance of deployed SER systems. This work proposes an SER system based on a novel robust multi-window spectrogram augmentation (RMWSaug) scheme and, transfer learning to handle these aforementioned issues simultaneously. First, the RMWSaug scheme utilizes the concept of multi-window and multi-noise conditioning of clean speech samples to create additional speech spectrograms required for training. Then, pretrained networks are adapted for speech emotion recognition and finetuned with the generated training datasets to develop a model robust to speech degradation due to noise. Thereby, improving the classification performance in the wild. The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database was selected as benchmark dataset for evaluating the proposed SER system. Experimental results show that the proposed SER system outperformed existing methods when deployed in the wild. The proposed SER system can be deployed to predict the emotions of speakers conversing virtually on online platforms.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::798513a70c72a531602176667c704945 https://doi.org/10.1109/asyu52992.2021.9598956 Zobrazit plný text záznamu