Audio-Visual Action Recognition Using Transformer Fusion Network

Autor:	Jun-Hwa Kim, Chee Sun Won
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	action recognition multi modal deep learning video Technology Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999
Zdroj:	Applied Sciences, Vol 14, Iss 3, p 1190 (2024)
Druh dokumentu:	article
ISSN:	2076-3417
DOI:	10.3390/app14031190
Popis:	Our approach to action recognition is grounded in the intrinsic coexistence of and complementary relationship between audio and visual information in videos. Going beyond the traditional emphasis on visual features, we propose a transformer-based network that integrates both audio and visual data as inputs. This network is designed to accept and process spatial, temporal, and audio modalities. Features from each modality are extracted using a single Swin Transformer, originally devised for still images. Subsequently, these extracted features from spatial, temporal, and audio data are adeptly combined using a novel modal fusion module (MFM). Our transformer-based network effectively fuses these three modalities, resulting in a robust solution for action recognition.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/b553acd97b924a7397508204bf976e2e Zobrazit plný text záznamu View record in DOAJ