Zobrazeno 1 - 10
of 66
pro vyhledávání: '"Nagaraja, Varun"'
Autor:
Lan, Gael Le, Shi, Bowen, Ni, Zhaoheng, Srinivasan, Sidd, Kumar, Anurag, Ellis, Brian, Kant, David, Nagaraja, Varun, Chang, Ernie, Hsu, Wei-Ning, Shi, Yangyang, Chandra, Vikas
We introduce MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model. It operates on continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec. Based on a diffusion transf
Externí odkaz:
http://arxiv.org/abs/2407.03648
Autor:
Chang, Ernie, Srinivasan, Sidd, Luthra, Mahi, Lin, Pin-Jie, Nagaraja, Varun, Iandola, Forrest, Liu, Zechun, Ni, Zhaoheng, Zhao, Changsheng, Shi, Yangyang, Chandra, Vikas
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compare
Externí odkaz:
http://arxiv.org/abs/2311.00897
Autor:
Mei, Xinhao, Nagaraja, Varun, Lan, Gael Le, Ni, Zhaoheng, Chang, Ernie, Shi, Yangyang, Chandra, Vikas
Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intric
Externí odkaz:
http://arxiv.org/abs/2309.10537
Autor:
Lan, Gael Le, Nagaraja, Varun, Chang, Ernie, Kant, David, Ni, Zhaoheng, Shi, Yangyang, Iandola, Forrest, Chandra, Vikas
In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, fla
Externí odkaz:
http://arxiv.org/abs/2309.08804
Autor:
Shi, Yangyang, Lan, Gael Le, Nagaraja, Varun, Ni, Zhaoheng, Mei, Xinhao, Chang, Ernie, Iandola, Forrest, Liu, Yang, Chandra, Vikas
This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverage
Externí odkaz:
http://arxiv.org/abs/2309.08773
Autor:
Shi, Yangyang, Wu, Chunyang, Wang, Dilin, Xiao, Alex, Mahadeokar, Jay, Zhang, Xiaohui, Liu, Chunxi, Li, Ke, Shangguan, Yuan, Nagaraja, Varun, Kalinli, Ozlem, Seltzer, Mike
This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution. Many works apply the causal convolution to improve streaming transformer ignoring the lookahead context. We propose to use non-causal con
Externí odkaz:
http://arxiv.org/abs/2110.05241
Autor:
Nagaraja, Varun, Shi, Yangyang, Venkatesh, Ganesh, Kalinli, Ozlem, Seltzer, Michael L., Chandra, Vikas
On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge sh
Externí odkaz:
http://arxiv.org/abs/2106.08960
Autor:
Shi, Yangyang, Nagaraja, Varun, Wu, Chunyang, Mahadeokar, Jay, Le, Duc, Prabhavalkar, Rohit, Xiao, Alex, Yeh, Ching-Feng, Chan, Julian, Fuegen, Christian, Kalinli, Ozlem, Seltzer, Michael L.
We propose a dynamic encoder transducer (DET) for on-device speech recognition. One DET model scales to multiple devices with different computation capacities without retraining or finetuning. To trading off accuracy and latency, DET assigns differen
Externí odkaz:
http://arxiv.org/abs/2104.02176
Referring expressions usually describe an object using properties of the object and relationships of the object with other objects. We propose a technique that integrates context between objects to understand referring expressions. Our approach uses
Externí odkaz:
http://arxiv.org/abs/1608.00525
To identify the location of objects of a particular class, a passive computer vision system generally processes all the regions in an image to finally output few regions. However, we can use structure in the scene to search for objects without proces
Externí odkaz:
http://arxiv.org/abs/1511.07710