Abstrakt: |
Speech emotion recognition (SER) is a vital component of the human–computer interaction system. The traditional Deep learning-based speech SER schemes show poor time-domain representation, class imbalance issues due to uneven samples in the training datasets, less feature distinctiveness, and inferior long-term dependency on global and local attributes of the speech. This article introduces lightweight, long short memory (LSTM) along with multiple acoustic features such as Mel frequency spectrum coefficients (MFCC), chroma, root mean square (RMS), and Tonnetz features for boosting the time-domain representation and long-term dependency of emotional speech. Further, it used data augmentation techniques such as noise addition, shifting, stretching, and feature extraction to lessen the class imbalance problem. The proposed methods have an average accuracy of 95% for SAVEE, 94.57% for EMOVO, 97.16% for EMODB, and 96.66% for BAVED, which has shown noteworthy improvement over the traditional state of arts. [ABSTRACT FROM AUTHOR] |