Zobrazeno 1 - 10
of 708
pro vyhledávání: '"Meng, Zhong"'
In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better A
Externí odkaz:
http://arxiv.org/abs/2406.14701
Autor:
Meng, Zhong, Wu, Zelin, Prabhavalkar, Rohit, Peyser, Cal, Wang, Weiran, Chen, Nanxin, Sainath, Tara N., Ramabhadran, Bhuvana
Publikováno v:
Interspeech 2024, Kos Island, Greece
Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhan
Externí odkaz:
http://arxiv.org/abs/2406.02921
Gradient clipping plays a vital role in training large-scale automatic speech recognition (ASR) models. It is typically applied to minibatch gradients to prevent gradient explosion, and to the individual sample gradients to mitigate unintended memori
Externí odkaz:
http://arxiv.org/abs/2406.02004
Autor:
Wu, Zelin, Song, Gan, Li, Christopher, Rondon, Pat, Meng, Zhong, Velez, Xavier, Wang, Weiran, Caseiro, Diamantino, Pundak, Golan, Munkhdalai, Tsendsuren, Chandorkar, Angad, Prabhavalkar, Rohit
Publikováno v:
2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics - Industry Track
Contextual biasing enables speech recognizers to transcribe important phrases in the speaker's context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for
Externí odkaz:
http://arxiv.org/abs/2404.10180
Autor:
Prabhavalkar, Rohit, Meng, Zhong, Wang, Weiran, Stooke, Adam, Cai, Xingyu, He, Yanzhang, Narayanan, Arun, Hwang, Dongseong, Sainath, Tara N., Moreno, Pedro J.
The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires c
Externí odkaz:
http://arxiv.org/abs/2402.17184
Autor:
Wang, Mingqiu, Han, Wei, Shafran, Izhak, Wu, Zelin, Chiu, Chung-Cheng, Cao, Yuan, Wang, Yongqiang, Chen, Nanxin, Zhang, Yu, Soltau, Hagen, Rubenstein, Paul, Zilka, Lukas, Yu, Dian, Meng, Zhong, Pundak, Golan, Siddhartha, Nikhil, Schalkwyk, Johan, Wu, Yonghui
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their
Externí odkaz:
http://arxiv.org/abs/2310.00230
Autor:
Wang, Weiran, Wu, Zelin, Caseiro, Diamantino, Munkhdalai, Tsendsuren, Sim, Khe Chai, Rondon, Pat, Pundak, Golan, Song, Gan, Prabhavalkar, Rohit, Meng, Zhong, Zhao, Ding, Sainath, Tara, Mengibar, Pedro Moreno
Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios. We propose algorithms for contextual biasing based on the Knuth-
Externí odkaz:
http://arxiv.org/abs/2310.00178
Autor:
Wang, Weiran, Prabhavalkar, Rohit, Hwang, Dongseong, Li, Qiujia, Sim, Khe Chai, Li, Bo, Qin, James, Cai, Xingyu, Stooke, Adam, Meng, Zhong, Zheng, CJ, He, Yanzhang, Sainath, Tara, Mengibar, Pedro Moreno
In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model paramet
Externí odkaz:
http://arxiv.org/abs/2309.12963
Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space seq
Externí odkaz:
http://arxiv.org/abs/2309.08551
Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tas
Externí odkaz:
http://arxiv.org/abs/2308.07395