MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Autor:	Shaikh, Muhammad Bilal, Chai, Douglas, Islam, Syed Mohammed Shamsul, Akhtar, Naveed
Rok vydání:	2023
Předmět:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Multimedia
Druh dokumentu:	Working Paper
Popis:	In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance. This underscores the potential enhancements derived from integrating audio and video modalities for action recognition purposes. Comment: 6 pages, 7 figures, 4 tables, Peer reviewed, Accepted @ The 11th European Workshop on Visual Information Processing (EUVIP) will be held on 11th-14th September 2023, in Gj{\o}vik, Norway. arXiv admin note: text overlap with arXiv:2103.15691 by other authors
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2308.03741 Zobrazit plný text záznamu View this record from Arxiv