Zobrazeno 1 - 10
of 16
pro vyhledávání: '"Zuo, Jialong"'
Autor:
Ji, Shengpeng, Chen, Yifu, Fang, Minghui, Zuo, Jialong, Lu, Jingyu, Wang, Hanting, Jiang, Ziyue, Zhou, Long, Liu, Shujie, Cheng, Xize, Yang, Xiaoda, Wang, Zehan, Yang, Qian, Li, Jian, Jiang, Yidi, He, Jingzhen, Chu, Yunfei, Xu, Jin, Zhao, Zhou
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), la
Externí odkaz:
http://arxiv.org/abs/2411.13577
Autor:
Cheng, Xize, Zheng, Siqi, Wang, Zehan, Fang, Minghui, Zhang, Ziang, Huang, Rongjie, Ma, Ziyang, Ji, Shengpeng, Zuo, Jialong, Jin, Tao, Zhao, Zhou
The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse inter
Externí odkaz:
http://arxiv.org/abs/2410.21269
Autor:
Zuo, Jialong, Nie, Ying, Zhou, Hanyu, Zhang, Huaxin, Wang, Haoyu, Guo, Tianyu, Sang, Nong, Gao, Changxin
Recent researches have proven that pre-training on large-scale person images extracted from internet videos is an effective way in learning better representations for person re-identification. However, these researches are mostly confined to pre-trai
Externí odkaz:
http://arxiv.org/abs/2409.18569
Autor:
Ji, Shengpeng, Jiang, Ziyue, Wang, Wen, Chen, Yifu, Fang, Minghui, Zuo, Jialong, Yang, Qian, Cheng, Xize, Wang, Zehan, Li, Ruiqi, Zhang, Ziang, Yang, Xiaoda, Huang, Rongjie, Jiang, Yidi, Chen, Qian, Zheng, Siqi, Zhao, Zhou
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional
Externí odkaz:
http://arxiv.org/abs/2408.16532
Autor:
Yang, Qian, Zuo, Jialong, Su, Zhe, Jiang, Ziyue, Li, Mingze, Zhao, Zhou, Chen, Feiyang, Wang, Zhefeng, Huai, Baoxing
We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed a
Externí odkaz:
http://arxiv.org/abs/2407.14006
Autor:
Fang, Minghui, Ji, Shengpeng, Zuo, Jialong, Huang, Hai, Xia, Yan, Zhu, Jieming, Cheng, Xize, Yang, Xiaoda, Liu, Wenrui, Wang, Gang, Dong, Zhenhua, Zhao, Zhou
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a sequence-to-sequence model to directly generate candidate identifiers based on natural language queries. Without explicitly computing the similarity betw
Externí odkaz:
http://arxiv.org/abs/2406.17507
Autor:
Zhang, Huaxin, Xu, Xiaohao, Wang, Xiang, Zuo, Jialong, Han, Chuchu, Huang, Xiaonan, Gao, Changxin, Wang, Yuehuan, Sang, Nong
Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that levera
Externí odkaz:
http://arxiv.org/abs/2406.12235
Autor:
Ji, Shengpeng, Zuo, Jialong, Wang, Wen, Fang, Minghui, Zheng, Siqi, Chen, Qian, Jiang, Ziyue, Huang, Hai, Wang, Zehan, Cheng, Xize, Zhao, Zhou
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual st
Externí odkaz:
http://arxiv.org/abs/2406.01205
Autor:
Hong, Jiahao, Zuo, Jialong, Han, Chuchu, Zheng, Ruochen, Tian, Ming, Gao, Changxin, Sang, Nong
Recent unsupervised person re-identification (re-ID) methods achieve high performance by leveraging fine-grained local context. These methods are referred to as part-based methods. However, most part-based methods obtain local contexts through horizo
Externí odkaz:
http://arxiv.org/abs/2403.00261
Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Takin
Externí odkaz:
http://arxiv.org/abs/2402.09378