MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Autor: Shaikh, Muhammad Bilal, Chai, Douglas, Islam, Syed Mohammed Shamsul, Akhtar, Naveed
Rok vydání: 2023
Předmět:
Druh dokumentu: Working Paper
Popis: In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance. This underscores the potential enhancements derived from integrating audio and video modalities for action recognition purposes.
Comment: 6 pages, 7 figures, 4 tables, Peer reviewed, Accepted @ The 11th European Workshop on Visual Information Processing (EUVIP) will be held on 11th-14th September 2023, in Gj{\o}vik, Norway. arXiv admin note: text overlap with arXiv:2103.15691 by other authors
Databáze: arXiv