Zobrazeno 1 - 10
of 27
pro vyhledávání: '"Li, Jinyu"'
Autor:
Wu, Jian, Gaur, Yashesh, Chen, Zhuo, Zhou, Long, Zhu, Yimeng, Wang, Tianrui, Li, Jinyu, Liu, Shujie, Ren, Bo, Liu, Linquan, Wu, Yu
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been e
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::99eb9d516394e3811b6f509b7c7fc3d0
http://arxiv.org/abs/2307.03917
http://arxiv.org/abs/2307.03917
Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadra
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::dc3855bab09f4bc7b8c3107b081d3b7f
http://arxiv.org/abs/2306.16009
http://arxiv.org/abs/2306.16009
The integration of Language Models (LMs) has proven to be an effective way to address domain shifts in speech recognition. However, these approaches usually require a significant amount of target domain text data for the training of LMs. Different fr
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::38090c0a55d370bec08810bda39f5985
http://arxiv.org/abs/2306.16007
http://arxiv.org/abs/2306.16007
Autor:
Yang, Muqiao, Kanda, Naoyuki, Wang, Xiaofei, Wu, Jian, Sivasankaran, Sunit, Chen, Zhuo, Li, Jinyu, Yoshioka, Takuya
Publikováno v:
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers. Due to the difficulty in acquiring real conversation data with high-quality human t
Publikováno v:
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
We previously proposed contextual spelling correction (CSC) to correct the output of end-to-end (E2E) automatic speech recognition (ASR) models with contextual information such as name, place, etc. Although CSC has achieved reasonable improvement in
Autor:
Wei, Kun, Zhou, Long, Zhang, Ziqiang, Chen, Liping, Liu, Shujie, He, Lei, Li, Jinyu, Wei, Furu
Publikováno v:
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of
Autor:
Wang, Tianrui, Zhou, Long, Zhang, Ziqiang, Wu, Yu, Liu, Shujie, Gaur, Yashesh, Chen, Zhuo, Li, Jinyu, Wei, Furu
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that u
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::071e31d93750cf76e2c5f8f576a56722
http://arxiv.org/abs/2305.16107
http://arxiv.org/abs/2305.16107
Autor:
Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a condit
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::d7122f1b2f27efd14eece69f9f82e478
In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transforme
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::97f88fbd214361d264329b545e34862f
Autor:
Zhang, Ziqiang, Zhou, Long, Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target lang
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::4332efdacb44ce5ea14a0c5016fb69e1