Zobrazeno 1 - 10
of 20
pro vyhledávání: '"Puvvada, Krishna C."'
Autor:
Peng, Yifan, Puvvada, Krishna C., Chen, Zhehuai, Zelasko, Piotr, Huang, He, Dhawan, Kunal, Hu, Ke, Watanabe, Shinji, Balam, Jagadeesh, Ginsburg, Boris
Recent studies have augmented large language models (LLMs) with speech capabilities, leading to the development of speech language models (SpeechLMs). Earlier SpeechLMs focused on single-turn speech-based question answering (QA), where user input com
Externí odkaz:
http://arxiv.org/abs/2410.17485
Autor:
Park, Taejin, Medennikov, Ivan, Dhawan, Kunal, Wang, Weiqing, Huang, He, Koluguri, Nithin Rao, Puvvada, Krishna C., Balam, Jagadeesh, Ginsburg, Boris
We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challe
Externí odkaz:
http://arxiv.org/abs/2409.06656
Autor:
Wang, Weiqing, Dhawan, Kunal, Park, Taejin, Puvvada, Krishna C., Medennikov, Ivan, Majumdar, Somshubra, Huang, He, Balam, Jagadeesh, Ginsburg, Boris
Speech foundation models have achieved state-of-the-art (SoTA) performance across various tasks, such as automatic speech recognition (ASR) in hundreds of languages. However, multi-speaker ASR remains a challenging task for these models due to data s
Externí odkaz:
http://arxiv.org/abs/2409.01438
Autor:
Huang, He, Park, Taejin, Dhawan, Kunal, Medennikov, Ivan, Puvvada, Krishna C., Koluguri, Nithin Rao, Wang, Weiqing, Balam, Jagadeesh, Ginsburg, Boris
Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this
Externí odkaz:
http://arxiv.org/abs/2408.13106
Autor:
Chen, Zhehuai, Huang, He, Hrinchuk, Oleksii, Puvvada, Krishna C., Koluguri, Nithin Rao, Żelasko, Piotr, Balam, Jagadeesh, Ginsburg, Boris
Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a s
Externí odkaz:
http://arxiv.org/abs/2406.19954
Autor:
Puvvada, Krishna C., Żelasko, Piotr, Huang, He, Hrinchuk, Oleksii, Koluguri, Nithin Rao, Dhawan, Kunal, Majumdar, Somshubra, Rastorgueva, Elena, Chen, Zhehuai, Lavrukhin, Vitaly, Balam, Jagadeesh, Ginsburg, Boris
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech trans
Externí odkaz:
http://arxiv.org/abs/2406.19674
Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In t
Externí odkaz:
http://arxiv.org/abs/2405.12983
Autor:
Park, Tae Jin, Huang, He, Jukic, Ante, Dhawan, Kunal, Puvvada, Krishna C., Koluguri, Nithin, Karpov, Nikolay, Laptev, Aleksandr, Balam, Jagadeesh, Ginsburg, Boris
Publikováno v:
CHiME-7 Workshop 2023
We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored t
Externí odkaz:
http://arxiv.org/abs/2310.12378
Autor:
Chen, Zhehuai, Huang, He, Andrusenko, Andrei, Hrinchuk, Oleksii, Puvvada, Krishna C., Li, Jason, Ghosh, Subhankar, Balam, Jagadeesh, Ginsburg, Boris
We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and a
Externí odkaz:
http://arxiv.org/abs/2310.09424
Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and representation-learning
Externí odkaz:
http://arxiv.org/abs/2309.10922