Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Autor:	Sarala Padi, Ram D. Sriram, Seyed Omid Sadjadi, Dinesh Manocha
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Sound (cs.SD) Computer Science - Artificial Intelligence business.industry Computer science Speech recognition Emotion classification Deep learning Pooling Computer Science - Human-Computer Interaction Overfitting Speaker recognition Motion capture Computer Science - Sound Human-Computer Interaction (cs.HC) Artificial Intelligence (cs.AI) Audio and Speech Processing (eess.AS) FOS: Electrical engineering electronic engineering information engineering Spectrogram Artificial intelligence Transfer of learning business Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj:	ICMI
DOI:	10.1145/3462244.3481003
Popis:	Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trained residual network (ResNet) model including a statistics pooling layer from speaker recognition trained using large amounts of speaker-labeled data. The statistics pooling layer enables the model to efficiently process variable-length input, thereby eliminating the need for sequence truncation which is commonly used in SER systems. In addition, we adopt a spectrogram augmentation technique to generate additional training data samples by applying random time-frequency masks to log-mel spectrograms to mitigate overfitting and improve the generalization of emotion recognition models. We evaluate the effectiveness of our proposed approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that the transfer learning and spectrogram augmentation approaches improve the SER performance, and when combined achieve state-of-the-art results. Comment: Accepted at ACM/SIGCHI ICMI'21
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::9355688b15111f058a2128f25c192a66 https://doi.org/10.1145/3462244.3481003 Zobrazit plný text záznamu