Popis: |
This article introduces a versatile multimodal architecture designed for personality-aware systems, encompassing tasks such as personality trait prediction, sentiment analysis, and emotion recognition. This is a unique attempt to develop a general pipeline that is applicable to the personality affect computing applications within the context of multimodal data. The proposed model employs task-specific feature extraction models that are appropriately trained for each application. An intermediate layer, employing both inter- and intra-attention mechanisms for fusion, is presented. This dual attention mechanism is further improved with a binary search algorithm, which is notably the key contribution of the work. This fusion models discerns distinctive features crucial for classification and regression tasks. To evaluate the system’s efficacy, short-duration video clips and corresponding transcriptions from databases were utilized. Low-level acoustic features were derived from audio signals, while high-level and mid-level audio features were extracted through a transformer-based sentence-RoBERTa model applied to audio transcripts. Visual features were obtained from context and facial images through deep face networks, followed by the use of CNN and LSTM models. Dimensionality reduction and multimodal fusion techniques were implemented prior to applying machine learning-based classification and prediction tasks. Performance metrics such as mean accuracy and squared correlation coefficients ( $R^{2}$ ) were chosen for prediction tasks, while accuracy and F1-score were employed for classification tasks. The study explored various fusion techniques and dimension-reduction approaches to establish an efficient pipeline, ultimately aiming to reduce uncertainties and enhance robustness. The results indicate that the proposed architecture performs comparably with state-of-the-art systems across all evaluated domains. |