Zobrazeno 1 - 10
of 214
pro vyhledávání: '"Du Zhihao"'
Autor:
Du, Zhihao, Wang, Yuxuan, Chen, Qian, Shi, Xian, Lv, Xiang, Zhao, Tianyu, Gao, Zhifu, Yang, Yexin, Gao, Changfeng, Wang, Hui, Yu, Fan, Liu, Huadai, Sheng, Zhengyan, Gu, Yue, Deng, Chong, Wang, Wen, Zhang, Shiliang, Yan, Zhijie, Zhou, Jingren
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, C
Externí odkaz:
http://arxiv.org/abs/2412.10117
While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, d
Externí odkaz:
http://arxiv.org/abs/2410.16726
Autor:
Zhang, Xin, Lyu, Xiang, Du, Zhihao, Chen, Qian, Zhang, Dong, Hu, Hangrui, Tan, Chaohong, Zhao, Tianyu, Wang, Yuxuan, Zhang, Bin, Lu, Heng, Zhou, Yaqian, Qiu, Xipeng
Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead an
Externí odkaz:
http://arxiv.org/abs/2410.08035
Autor:
Du, Zhihao, Chen, Qian, Zhang, Shiliang, Hu, Kai, Lu, Heng, Yang, Yexin, Hu, Hangrui, Zheng, Siqi, Gu, Yue, Ma, Ziyang, Gao, Zhifu, Yan, Zhijie
Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, wh
Externí odkaz:
http://arxiv.org/abs/2407.05407
Autor:
An, Keyu, Chen, Qian, Deng, Chong, Du, Zhihao, Gao, Changfeng, Gao, Zhifu, Gu, Yue, He, Ting, Hu, Hangrui, Hu, Kai, Ji, Shengpeng, Li, Yabin, Li, Zerui, Lu, Heng, Luo, Haoneng, Lv, Xiang, Ma, Bin, Ma, Ziyang, Ni, Chongjia, Song, Changhe, Shi, Jiaqi, Shi, Xian, Wang, Hao, Wang, Wen, Wang, Yuxuan, Xiao, Zhangyu, Yan, Zhijie, Yang, Yexin, Zhang, Bin, Zhang, Qinglin, Zhang, Shiliang, Zhao, Nan, Zheng, Siqi
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emoti
Externí odkaz:
http://arxiv.org/abs/2407.04051
Autor:
Ma, Ziyang, Yang, Guanrou, Yang, Yifan, Gao, Zhifu, Wang, Jiaming, Du, Zhihao, Yu, Fan, Chen, Qian, Zheng, Siqi, Zhang, Shiliang, Chen, Xie
In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as
Externí odkaz:
http://arxiv.org/abs/2402.08846
Autor:
Li, Yangze, Yu, Fan, Liang, Yuhao, Guo, Pengcheng, Shi, Mohan, Du, Zhihao, Zhang, Shiliang, Xie, Lei
Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are ba
Externí odkaz:
http://arxiv.org/abs/2310.04863
Autor:
Du, Zhihao, Wang, Jiaming, Chen, Qian, Chu, Yunfei, Gao, Zhifu, Li, Zerui, Hu, Kai, Zhou, Xiaohuan, Xu, Jin, Ma, Ziyang, Wang, Wen, Zheng, Siqi, Zhou, Chang, Yan, Zhijie, Zhang, Shiliang
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-
Externí odkaz:
http://arxiv.org/abs/2310.04673
Autor:
Liang, Yuhao, Shi, Mohan, Yu, Fan, Li, Yangze, Zhang, Shiliang, Du, Zhihao, Chen, Qian, Xie, Lei, Qian, Yanmin, Wu, Jian, Chen, Zhuo, Lee, Kong Aik, Yan, Zhijie, Bu, Hui
With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which dir
Externí odkaz:
http://arxiv.org/abs/2309.13573
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech cod
Externí odkaz:
http://arxiv.org/abs/2309.07405