Zobrazeno 1 - 10
of 480
pro vyhledávání: '"Serra, Joan"'
Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronizat
Externí odkaz:
http://arxiv.org/abs/2407.10387
Contrastive learning has emerged as a powerful technique in audio-visual representation learning, leveraging the natural co-occurrence of audio and visual modalities in extensive web-scale video datasets to achieve significant advancements. However,
Externí odkaz:
http://arxiv.org/abs/2407.05782
Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing
Externí odkaz:
http://arxiv.org/abs/2310.00140
Autor:
Serrà, Joan, Scaini, Davide, Pascual, Santiago, Arteaga, Daniel, Pons, Jordi, Breebaart, Jeroen, Cengarle, Giulio
Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to s
Externí odkaz:
http://arxiv.org/abs/2306.14647
Autor:
Dong, Hao-Wen, Liu, Xiaoyu, Pons, Jordi, Bhattacharya, Gautam, Pascual, Santiago, Serrà, Joan, Berg-Kirkpatrick, Taylor, McAuley, Julian
Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled
Externí odkaz:
http://arxiv.org/abs/2306.09635
Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of
Externí odkaz:
http://arxiv.org/abs/2210.14661
Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enr
Externí odkaz:
http://arxiv.org/abs/2210.12635
Universal sound separation consists of separating mixes with arbitrary sounds of different types, and permutation invariant training (PIT) is used to train source agnostic models that do so. In this work, we complement PIT with adversarial losses but
Externí odkaz:
http://arxiv.org/abs/2210.12108