Výsledky vyhledávání

Report

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

Autor: Wang, Zhiyong, Fu, Ruibo, Wen, Zhengqi, Tao, Jianhua, Wang, Xiaopeng, Xie, Yuankun, Qi, Xin, Shi, Shuchen, Lu, Yi, Liu, Yukun, Li, Chenxing, Liu, Xuefei, Li, Guanjun

Speech synthesis technology has posed a serious threat to speaker verification systems. Currently, the most effective fake audio detection methods utilize pretrained models, and integrating features from various layers of pretrained model further enh

Externí odkaz: http://arxiv.org/abs/2409.11909

Zobrazit plný text záznamu

Report

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

Autor: Qi, Xin, Fu, Ruibo, Wen, Zhengqi, Wang, Tao, Qiang, Chunyu, Tao, Jianhua, Li, Chenxing, Lu, Yi, Shi, Shuchen, Wang, Zhiyong, Wang, Xiaopeng, Xie, Yuankun, Liu, Yukun, Liu, Xuefei, Li, Guanjun

In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel sp

Externí odkaz: http://arxiv.org/abs/2409.11835

Zobrazit plný text záznamu

Report

Towards Diverse and Efficient Audio Captioning via Diffusion Models

Autor: Xu, Manjie, Li, Chenxing, Tu, Xinyi, Ren, Yong, Fu, Ruibo, Liang, Wei, Yu, Dong

We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive diffusion model tailored for diverse and efficient audio captioning. Although existing captioning models relying on language backbones have achieved remarkable success in vario

Externí odkaz: http://arxiv.org/abs/2409.09401

Zobrazit plný text záznamu

Report

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

Autor: Xiong, Chenxu, Fu, Ruibo, Shi, Shuchen, Wen, Zhengqi, Tao, Jianhua, Wang, Tao, Li, Chenxing, Qiang, Chunyu, Xie, Yuankun, Qi, Xin, Li, Guanjun, Yang, Zizheng

Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is propose

Externí odkaz: http://arxiv.org/abs/2409.09381

Zobrazit plný text záznamu

Report

Exploring the Role of Audio in Multimodal Misinformation Detection

Autor: Liu, Moyang, Liu, Yukun, Fu, Ruibo, Wen, Zhengqi, Tao, Jianhua, Liu, Xuefei, Li, Guanjun

With the rapid development of deepfake technology, especially the deep audio fake technology, misinformation detection on the social media scene meets a great challenge. Social media data often contains multimodal information which includes audio, vi

Externí odkaz: http://arxiv.org/abs/2408.12558

Zobrazit plný text záznamu

Report

Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?

Autor: Xie, Yuankun, Xiong, Chenxu, Wang, Xiaopeng, Wang, Zhiyong, Lu, Yi, Qi, Xin, Fu, Ruibo, Liu, Yukun, Wen, Zhengqi, Tao, Jianhua, Li, Guanjun, Ye, Long

Currently, Audio Language Models (ALMs) are rapidly advancing due to the developments in large language models and audio neural codecs. These ALMs have significantly lowered the barrier to creating deepfake audio, generating highly realistic and dive

Externí odkaz: http://arxiv.org/abs/2408.10853

Zobrazit plný text záznamu

Report

EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech

Autor: Qi, Xin, Fu, Ruibo, Wen, Zhengqi, Tao, Jianhua, Shi, Shuchen, Lu, Yi, Wang, Zhiyong, Wang, Xiaopeng, Xie, Yuankun, Liu, Yukun, Li, Guanjun, Liu, Xuefei, Li, Yongwei

In the current era of Artificial Intelligence Generated Content (AIGC), a Low-Rank Adaptation (LoRA) method has emerged. It uses a plugin-based approach to learn new knowledge with lower parameter quantities and computational costs, and it can be plu

Externí odkaz: http://arxiv.org/abs/2408.10852

Zobrazit plný text záznamu

Report

A Noval Feature via Color Quantisation for Fake Audio Detection

Autor: Wang, Zhiyong, Wang, Xiaopeng, Xie, Yuankun, Fu, Ruibo, Wen, Zhengqi, Tao, Jianhua, Liu, Yukun, Li, Guanjun, Qi, Xin, Lu, Yi, Liu, Xuefei, Li, Yongwei

In the field of deepfake detection, previous studies focus on using reconstruction or mask and prediction methods to train pre-trained models, which are then transferred to fake audio detection training where the encoder is used to extract features,

Externí odkaz: http://arxiv.org/abs/2408.10849

Zobrazit plný text záznamu

Report

Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge

Autor: Xie, Yuankun, Wang, Xiaopeng, Wang, Zhiyong, Fu, Ruibo, Wen, Zhengqi, Cheng, Haonan, Ye, Long

ASVspoof5, the fifth edition of the ASVspoof series, is one of the largest global audio security challenges. It aims to advance the development of countermeasure (CM) to discriminate bonafide and spoofed speech utterances. In this paper, we focus on

Externí odkaz: http://arxiv.org/abs/2408.06922

Zobrazit plný text záznamu

Report

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Autor: Qiang, Chunyu, Geng, Wang, Zhao, Yi, Fu, Ruibo, Wang, Tao, Gong, Cheng, Wang, Tianrui, Liu, Qiuyu, Yi, Jiangyan, Wen, Zhengqi, Zhang, Chen, Che, Hao, Wang, Longbiao, Dang, Jianwu, Tao, Jianhua

Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) se

Externí odkaz: http://arxiv.org/abs/2408.05758

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání