Popis: |
An audio signal is an analogue signal representation in one-dimensional function x(t) with t the continual variable depicting time. Such signals, generated from diverse sources, can be discerned as music, speech, noise or any combination. For machines to understand, these audio signals must be represented such as the extraction of its features which are representations of the composition of the audio signal and behavior over time. Audio feature extraction can enhance the efficacy of audio processing and hence a benefit for numerous applications. We are presenting an emotion classification analysis with reference to audio representation (1 Dimensional and 2 Dimensional) with focus on audio recordings obtainable in Ryerson Audio-Visual Database of Emotion Speech and Song (RAVDESS) dataset, classification is based on eight (8) different emotions. We scrutinize the accuracy evaluation metric on the average of five (5) iterations for each audio signal (raw audio, normalized raw audio and spectrogram) representation. This presents the extraction of features in 1D and 2D as input using the Convolutional Neutral Network (CNN). A Variance of analysis (ANOVA - single factor) analysis was done to test the hypotheses on obtained accuracy values to show significance between the different audio signal representations of the dataset. Results obtained on F-ratio is greater than the critical F-ratio hence this value lies in the critical region. Thus, a shred of evidence that at 0.05 significance level, the true mean of the varied dataset does differ. |