Zobrazeno 1 - 5
of 5
pro vyhledávání: '"Nadeem, Asmar"'
We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through
Externí odkaz:
http://arxiv.org/abs/2411.05603
Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the o
Externí odkaz:
http://arxiv.org/abs/2408.14441
Autor:
Nadeem, Asmar, Sardari, Faegheh, Dawes, Robert, Husain, Syed Sameed, Hilton, Adrian, Mustafa, Armin
Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative r
Externí odkaz:
http://arxiv.org/abs/2406.06499
In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) inf
Externí odkaz:
http://arxiv.org/abs/2310.16754
Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from t
Externí odkaz:
http://arxiv.org/abs/2303.14829