Zobrazeno 1 - 10
of 7 121
pro vyhledávání: '"LI, JUNJIE"'
Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of a specific target speaker from an audio mixture using time-synchronized visual cues. In real-world scenarios, visual cues are not always available due to various impairment
Externí odkaz:
http://arxiv.org/abs/2412.08247
Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic align
Externí odkaz:
http://arxiv.org/abs/2410.20109
Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers ma
Externí odkaz:
http://arxiv.org/abs/2410.16059
Autor:
Wang, Shuai, Zhang, Ke, Lin, Shaoxiong, Li, Junjie, Wang, Xuefei, Ge, Meng, Yu, Jianwei, Qian, Yanmin, Li, Haizhou
Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In recent years, TSE draws increasing attention due to its poten
Externí odkaz:
http://arxiv.org/abs/2409.15799
Deep learning technologies have significantly advanced the performance of target speaker extraction (TSE) tasks. To enhance the generalization and robustness of these algorithms when training data is insufficient, data augmentation is a commonly adop
Externí odkaz:
http://arxiv.org/abs/2409.09589
Generative models aim to simulate realistic effects of various actions across different contexts, from text generation to visual effects. Despite efforts to build real-world simulators, leveraging generative models for virtual worlds, like financial
Externí odkaz:
http://arxiv.org/abs/2409.07486
Autor:
Guo, Yiwei, Li, Zhihan, Li, Junjie, Du, Chenpeng, Wang, Hankun, Wang, Shuai, Chen, Xie, Yu, Kai
We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend
Externí odkaz:
http://arxiv.org/abs/2409.01995
AI-powered coding assistants such as GitHub Copilot and OpenAI ChatGPT have achieved notable success in automating code generation. However, these tools rely on pre-trained Large Language Models (LLMs) that are typically trained on human-written code
Externí odkaz:
http://arxiv.org/abs/2408.09078
Remote sensing shadow removal, which aims to recover contaminated surface information, is tricky since shadows typically display overwhelmingly low illumination intensities. In contrast, the infrared image is robust toward significant light changes,
Externí odkaz:
http://arxiv.org/abs/2406.17469
Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteri
Externí odkaz:
http://arxiv.org/abs/2406.02167