Fusion Approaches for Emotion Recognition from Speech Using Acoustic and Text-Based Features

Autor:	Luciana Ferrer, Leonardo Pepino, Agustín Gravano, Pablo Riera
Rok vydání:	2020
Předmět:	Fusion Computer science business.industry Speech recognition Deep learning 02 engineering and technology 010501 environmental sciences 01 natural sciences Transcription (linguistics) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Emotion recognition Artificial intelligence business Word (computer architecture) 0105 earth and related environmental sciences
Zdroj:	ICASSP
DOI:	10.1109/icassp40776.2020.9054709
Popis:	In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSPPODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating transcriptions.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::ba3d5855ee2bdc7fa8f3a6499b7ebf94 https://doi.org/10.1109/icassp40776.2020.9054709 Zobrazit plný text záznamu