Zobrazeno 1 - 10
of 1 893
pro vyhledávání: '"Chen, Xie"'
Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE
Externí odkaz:
http://arxiv.org/abs/2407.03892
Autor:
Ji, Wenjie, Chen, Xie
The bulk-boundary correspondence of topological phases suggests strong connections between the topological features in a d+1-dimensional bulk and the potentially gapless theory on the (d-1)+1-dimensional boundary. In 2+1D topological phases, a direct
Externí odkaz:
http://arxiv.org/abs/2407.02488
Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit ali
Externí odkaz:
http://arxiv.org/abs/2406.15752
Autor:
Jiang, Anbai, Han, Bing, Lv, Zhiqiang, Deng, Yufeng, Zhang, Wei-Qiang, Chen, Xie, Qian, Yanmin, Liu, Jia, Fan, Pingyi
Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machi
Externí odkaz:
http://arxiv.org/abs/2406.11364
Autor:
Yang, Yifan, Song, Zheshu, Zhuo, Jianheng, Cui, Mingyu, Li, Jinpeng, Yang, Bo, Du, Yexing, Ma, Ziyang, Liu, Xunying, Wang, Ziyuan, Li, Ke, Fan, Shuai, Yu, Kai, Zhang, Wei-Qiang, Chen, Guoguo, Chen, Xie
The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpe
Externí odkaz:
http://arxiv.org/abs/2406.11546
Autor:
Chang, Xuankai, Shi, Jiatong, Tian, Jinchuan, Wu, Yuning, Tang, Yuxun, Wu, Yihan, Watanabe, Shinji, Adi, Yossi, Chen, Xie, Jin, Qin
Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compr
Externí odkaz:
http://arxiv.org/abs/2406.07725
Autor:
Ma, Ziyang, Chen, Mingjie, Zhang, Hezhao, Zheng, Zhisheng, Chen, Wenxi, Li, Xiquan, Ye, Jiaxin, Chen, Xie, Hain, Thomas
Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are
Externí odkaz:
http://arxiv.org/abs/2406.07162
As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh per
Externí odkaz:
http://arxiv.org/abs/2406.05839
Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of multilingual datasets. Despite that, two main challenges persist in multilingua
Externí odkaz:
http://arxiv.org/abs/2406.06619
Autor:
Chen, Mingjie, Zhang, Hezhao, Li, Yuanchao, Luo, Jiachen, Wu, Wen, Ma, Ziyang, Bell, Peter, Lai, Catherine, Reiss, Joshua, Wang, Lin, Woodland, Philip C., Chen, Xie, Phan, Huy, Hain, Thomas
Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to s
Externí odkaz:
http://arxiv.org/abs/2405.20064