Zobrazeno 1 - 10
of 7 113
pro vyhledávání: '"DU, Jun"'
In multimodal sentiment analysis, collecting text data is often more challenging than video or audio due to higher annotation costs and inconsistent automatic speech recognition (ASR) quality. To address this challenge, our study has developed a robu
Externí odkaz:
http://arxiv.org/abs/2410.15029
Autor:
Cheng, Hanbo, Lin, Limin, Liu, Chenyu, Xia, Pengcheng, Hu, Pengfei, Ma, Jiefeng, Du, Jun, Pan, Jia
Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autore
Externí odkaz:
http://arxiv.org/abs/2410.13726
In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary c
Externí odkaz:
http://arxiv.org/abs/2410.22350
In the two-person conversation scenario with one wearing smart glasses, transcribing and displaying the speaker's content in real-time is an intriguing application, providing a priori information for subsequent tasks such as translation and comprehen
Externí odkaz:
http://arxiv.org/abs/2410.05986
Autor:
Liu, Shuhang, Zhang, Zhenrong, Hu, Pengfei, Ma, Jiefeng, Du, Jun, Wang, Qing, Zhang, Jianshu, Liu, Chenyu
In the digital era, the ability to understand visually rich documents that integrate text, complex layouts, and imagery is critical. Traditional Key Information Extraction (KIE) methods primarily rely on Optical Character Recognition (OCR), which oft
Externí odkaz:
http://arxiv.org/abs/2409.19573
Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speak
Externí odkaz:
http://arxiv.org/abs/2409.16803
In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively com
Externí odkaz:
http://arxiv.org/abs/2409.13148
In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mech
Externí odkaz:
http://arxiv.org/abs/2409.11887
Autor:
Xue, Hongfei, Gong, Rong, Shao, Mingchen, Xu, Xin, Wang, Lezhi, Xie, Lei, Bu, Hui, Zhou, Jiaming, Qin, Yong, Du, Jun, Li, Ming, Zhang, Binbin, Jia, Bin
The StutteringSpeech Challenge focuses on advancing speech technologies for people who stutter, specifically targeting Stuttering Event Detection (SED) and Automatic Speech Recognition (ASR) in Mandarin. The challenge comprises three tracks: (1) SED,
Externí odkaz:
http://arxiv.org/abs/2409.05430
Autor:
Niu, Shutong, Wang, Ruoyu, Du, Jun, Yang, Gaobin, Tu, Yanhui, Wu, Siyuan, Qian, Shuangqing, Wu, Huaxin, Xu, Haitao, Zhang, Xueyang, Zhong, Guolong, Yu, Xindi, Chen, Jieru, Wang, Mengzhi, Cai, Di, Gao, Tian, Wan, Genshun, Ma, Feng, Pan, Jia, Gao, Jianqing
This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap r
Externí odkaz:
http://arxiv.org/abs/2409.02041