Zobrazeno 41 - 50
of 2 053
pro vyhledávání: '"Li, Jinyu"'
Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription h
Externí odkaz:
http://arxiv.org/abs/2211.09412
Autor:
Huang, Zili, Chen, Zhuo, Kanda, Naoyuki, Wu, Jian, Wang, Yiming, Li, Jinyu, Yoshioka, Takuya, Wang, Xiaofei, Wang, Peidong
Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applic
Externí odkaz:
http://arxiv.org/abs/2211.05564
Autor:
Chen, Zhuo, Kanda, Naoyuki, Wu, Jian, Wu, Yu, Wang, Xiaofei, Yoshioka, Takuya, Li, Jinyu, Sivasankaran, Sunit, Eskimez, Sefik Emre
Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-tra
Externí odkaz:
http://arxiv.org/abs/2211.05172
Autor:
Gaur, Yashesh, Kibre, Nick, Xue, Jian, Shu, Kangyuan, Wang, Yuhui, Alphanso, Issac, Li, Jinyu, Gong, Yifan
Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State
Externí odkaz:
http://arxiv.org/abs/2211.03721
Autor:
Wang, Peidong, Sun, Eric, Xue, Jian, Wu, Yu, Zhou, Long, Gaur, Yashesh, Liu, Shujie, Li, Jinyu
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST model
Externí odkaz:
http://arxiv.org/abs/2211.02809
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high
Externí odkaz:
http://arxiv.org/abs/2211.02499
Autor:
Wei, Kun, Zhou, Long, Zhang, Ziqiang, Chen, Liping, Liu, Shujie, He, Lei, Li, Jinyu, Wei, Furu
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of
Externí odkaz:
http://arxiv.org/abs/2210.17027
Autor:
Yang, Muqiao, Kanda, Naoyuki, Wang, Xiaofei, Wu, Jian, Sivasankaran, Sunit, Chen, Zhuo, Li, Jinyu, Yoshioka, Takuya
Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers. Due to the difficulty in acquiring real conversation data with high-quality human t
Externí odkaz:
http://arxiv.org/abs/2210.15715
Masked language model (MLM) has been widely used for understanding tasks, e.g. BERT. Recently, MLM has also been used for generation tasks. The most popular one in speech is using Mask-CTC for non-autoregressive speech recognition. In this paper, we
Externí odkaz:
http://arxiv.org/abs/2210.08665
In this work, we present a simple but effective method, CTCBERT, for advancing hidden-unit BERT (HuBERT). HuBERT applies a frame-level cross-entropy (CE) loss, which is similar to most acoustic model training. However, CTCBERT performs the model trai
Externí odkaz:
http://arxiv.org/abs/2210.08603