Zobrazeno 1 - 10
of 970
pro vyhledávání: '"kumar, Anurag"'
Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content from text descriptions. They have also enabled an editing paradigm where concepts can be replaced through text conditioning (e.g., a dog to a tiger).
Externí odkaz:
http://arxiv.org/abs/2410.24151
Objective speech quality measures are typically used to assess speech enhancement algorithms, but it has been shown that they are sub-optimal as learning objectives because they do not always align well with human subjective ratings. This misalignmen
Externí odkaz:
http://arxiv.org/abs/2410.13182
In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on th
Externí odkaz:
http://arxiv.org/abs/2410.07463
Direction-of-arrival estimation of multiple speakers in a room is an important task for a wide range of applications. In particular, challenging environments with moving speakers, reverberation and noise, lead to significant performance degradation f
Externí odkaz:
http://arxiv.org/abs/2409.14346
Smart glasses are emerging as a popular wearable computing platform potentially revolutionizing the next generation of human-computer interaction. The widespread adoption of smart glasses has created a pressing need for discreet and hands-free contro
Externí odkaz:
http://arxiv.org/abs/2408.11346
Autor:
Kumar, Anurag, Sundaresan, Rajesh
One of the requirements of network slicing in 5G networks is RAN (radio access network) scheduling with rate guarantees. We study a three-time-scale algorithm for maximum sum utility scheduling, with minimum rate constraints. As usual, the scheduler
Externí odkaz:
http://arxiv.org/abs/2408.09182
Autor:
Yun, Heeseung, Gao, Ruohan, Ananthabhotla, Ishwarya, Kumar, Anurag, Donley, Jacob, Li, Chao, Kim, Gunhee, Ithapu, Vamsi Krishna, Murdock, Calvin
Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which
Externí odkaz:
http://arxiv.org/abs/2408.05364
Autor:
Lan, Gael Le, Shi, Bowen, Ni, Zhaoheng, Srinivasan, Sidd, Kumar, Anurag, Ellis, Brian, Kant, David, Nagaraja, Varun, Chang, Ernie, Hsu, Wei-Ning, Shi, Yangyang, Chandra, Vikas
We introduce MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model. It operates on continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec. Based on a diffusion transf
Externí odkaz:
http://arxiv.org/abs/2407.03648
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet
Externí odkaz:
http://arxiv.org/abs/2406.11619
Autor:
Zhang, Wangyou, Scheibler, Robin, Saijo, Kohei, Cornell, Samuele, Li, Chenda, Ni, Zhaoheng, Kumar, Anurag, Pirklbauer, Jan, Sach, Marvin, Watanabe, Shinji, Fingscheidt, Tim, Qian, Yanmin
The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this
Externí odkaz:
http://arxiv.org/abs/2406.04660