Výsledky vyhledávání - "Qian, Xinyuan"

Report

Text-Queried Target Sound Event Localization

Autor: Zhao, Jinzheng, Qian, Xinyuan, Xu, Yong, Liu, Haohe, Cao, Yin, Berghi, Davide, Wang, Wenwu

Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classe

Externí odkaz: http://arxiv.org/abs/2406.16058

Zobrazit plný text záznamu

Report

An Exploration of Length Generalization in Transformer-Based Speech Enhancement

Autor: Zhang, Qiquan, Zhu, Hongxu, Qian, Xinyuan, Ambikairajah, Eliathamby, Li, Haizhou

The use of Transformer architectures has facilitated remarkable progress in speech enhancement. Training Transformers using substantially long speech utterances is often infeasible as self-attention suffers from quadratic complexity. It is a critical

Externí odkaz: http://arxiv.org/abs/2406.11401

Zobrazit plný text záznamu

Report

Mamba in Speech: Towards an Alternative to Self-Attention

Autor: Zhang, Xiangyu, Zhang, Qiquan, Liu, Hexin, Xiao, Tianyi, Qian, Xinyuan, Ahmed, Beena, Ambikairajah, Eliathamby, Li, Haizhou, Epps, Julien

Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer,

Externí odkaz: http://arxiv.org/abs/2405.12609

Zobrazit plný text záznamu

Report

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Autor: Tao, Ruijie, Qian, Xinyuan, Jiang, Yidi, Li, Junjie, Wang, Jiadong, Li, Haizhou

Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this st

Externí odkaz: http://arxiv.org/abs/2404.18501

Zobrazit plný text záznamu

Report

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Autor: Tao, Ruijie, Qian, Xinyuan, Das, Rohan Kumar, Gao, Xiaoxue, Wang, Jiadong, Li, Haizhou

Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in add

Externí odkaz: http://arxiv.org/abs/2404.00861

Zobrazit plný text záznamu

Report

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Autor: Zhao, Jinzheng, Xu, Yong, Qian, Xinyuan, Berghi, Davide, Wu, Peipei, Cui, Meng, Sun, Jianyuan, Jackson, Philip J. B., Wang, Wenwu

Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visu

Externí odkaz: http://arxiv.org/abs/2310.14778

Zobrazit plný text záznamu

Report

LocSelect: Target Speaker Localization with an Auditory Selective Hearing Mechanism

Autor: Chen, Yu, Qian, Xinyuan, Pan, Zexu, Chen, Kainan, Li, Haizhou

The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating and providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers. In this

Externí odkaz: http://arxiv.org/abs/2310.10497

Zobrazit plný text záznamu

Report

Audio Visual Speaker Localization from EgoCentric Views

Autor: Zhao, Jinzheng, Xu, Yong, Qian, Xinyuan, Wang, Wenwu

The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unl

Externí odkaz: http://arxiv.org/abs/2309.16308

Zobrazit plný text záznamu

Report

InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition

Autor: Lai, Zhi-Hao, Zhang, Tian-Hao, Liu, Qi, Qian, Xinyuan, Wei, Li-Fang, Chen, Song-Lu, Chen, Feng, Yin, Xu-Cheng

The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention

Externí odkaz: http://arxiv.org/abs/2305.16342

Zobrazit plný text záznamu

Report

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Autor: Zhang, Tian-Hao, Qin, Hai-Bo, Lai, Zhi-Hao, Chen, Song-Lu, Liu, Qi, Chen, Feng, Qian, Xinyuan, Yin, Xu-Cheng

Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate

Externí odkaz: http://arxiv.org/abs/2305.14049

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání