Zobrazeno 1 - 10
of 1 392
pro vyhledávání: '"Jackson, Philip"'
Autor:
Berghi, Davide, Jackson, Philip J. B.
This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation (Track B). Our main model is based on the audio-visual (AV) Conformer, which
Externí odkaz:
http://arxiv.org/abs/2410.22271
Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dep
Externí odkaz:
http://arxiv.org/abs/2406.06187
Autor:
Berghi, Davide, Jackson, Philip J. B.
Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed
Externí odkaz:
http://arxiv.org/abs/2406.00495
Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However,
Externí odkaz:
http://arxiv.org/abs/2405.10690
Autor:
Berghi, Davide, Jackson, Philip J. B.
Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face o
Externí odkaz:
http://arxiv.org/abs/2312.14021
Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio
Externí odkaz:
http://arxiv.org/abs/2312.09034
Autor:
Zhao, Jinzheng, Xu, Yong, Qian, Xinyuan, Berghi, Davide, Wu, Peipei, Cui, Meng, Sun, Jianyuan, Jackson, Philip J. B., Wang, Wenwu
Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visu
Externí odkaz:
http://arxiv.org/abs/2310.14778
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal po
Externí odkaz:
http://arxiv.org/abs/2308.05051