Popis: |
These days, human-computer interactions develop in an alarmingly fast rate. To keep up with this development, one of many things to be advanced is machine's capability of recognizing human emotions through speech, or simply put, Speech Emotion Recognition (SER). Various studies regarding SER have been carried out using varying data modalities, such as TV shows, movies, and actor voice recordings. While the result may be proven satisfying, to collect these data of TV and actor recordings can be quite difficult and may require some costs. On the other hand, YouTube is an open and free platform for data gathering, and retrieving data from YouTube is effortless as well. Despite that, almost none of SER studies have tried this method of data collecting. This paper presents SER in Indonesian language, using Indonesian YouTube Web Series dataset with 4 labels of emotions. In the beginning, several experiments were carried out to determine which deep learning approach trained with which specific combination of features would yield out the most favorable result. The initial stage of the experiments showed that the Convolutional Neural Network (CNN) using a feature combination of MFCC, Contrast, and Tonnetz, gives better performance than other deep learning approach that we use. After tuning parameter process, we obtain that CNN with the combination of MFCC, Contrast, and Tonnetz gives 62.30% of F1 - Score. |