Zobrazeno 1 - 10
of 170
pro vyhledávání: '"Qian, Xinyuan"'
Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classe
Externí odkaz:
http://arxiv.org/abs/2406.16058
The use of Transformer architectures has facilitated remarkable progress in speech enhancement. Training Transformers using substantially long speech utterances is often infeasible as self-attention suffers from quadratic complexity. It is a critical
Externí odkaz:
http://arxiv.org/abs/2406.11401
Autor:
Zhang, Xiangyu, Zhang, Qiquan, Liu, Hexin, Xiao, Tianyi, Qian, Xinyuan, Ahmed, Beena, Ambikairajah, Eliathamby, Li, Haizhou, Epps, Julien
Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer,
Externí odkaz:
http://arxiv.org/abs/2405.12609
Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this st
Externí odkaz:
http://arxiv.org/abs/2404.18501
Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in add
Externí odkaz:
http://arxiv.org/abs/2404.00861
Autor:
Zhao, Jinzheng, Xu, Yong, Qian, Xinyuan, Berghi, Davide, Wu, Peipei, Cui, Meng, Sun, Jianyuan, Jackson, Philip J. B., Wang, Wenwu
Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visu
Externí odkaz:
http://arxiv.org/abs/2310.14778
The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating and providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers. In this
Externí odkaz:
http://arxiv.org/abs/2310.10497
The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unl
Externí odkaz:
http://arxiv.org/abs/2309.16308
Autor:
Lai, Zhi-Hao, Zhang, Tian-Hao, Liu, Qi, Qian, Xinyuan, Wei, Li-Fang, Chen, Song-Lu, Chen, Feng, Yin, Xu-Cheng
The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention
Externí odkaz:
http://arxiv.org/abs/2305.16342
Autor:
Zhang, Tian-Hao, Qin, Hai-Bo, Lai, Zhi-Hao, Chen, Song-Lu, Liu, Qi, Chen, Feng, Qian, Xinyuan, Yin, Xu-Cheng
Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate
Externí odkaz:
http://arxiv.org/abs/2305.14049