Výsledky vyhledávání

Report

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Autor: Weng, Yuzhe, Wang, Haotian, Gao, Tian, Li, Kewei, Niu, Shutong, Du, Jun

In multimodal sentiment analysis, collecting text data is often more challenging than video or audio due to higher annotation costs and inconsistent automatic speech recognition (ASR) quality. To address this challenge, our study has developed a robu

Externí odkaz: http://arxiv.org/abs/2410.15029

Zobrazit plný text záznamu

Report

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Autor: Cheng, Hanbo, Lin, Limin, Liu, Chenyu, Xia, Pengcheng, Hu, Pengfei, Ma, Jiefeng, Du, Jun, Pan, Jia

Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autore

Externí odkaz: http://arxiv.org/abs/2410.13726

Zobrazit plný text záznamu

Report

Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

Autor: He, Mao-Kui, Du, Jun, Niu, Shu-Tong, Liu, Qing-Feng, Lee, Chin-Hui

In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary c

Externí odkaz: http://arxiv.org/abs/2410.22350

Zobrazit plný text záznamu

Report

The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge

Autor: Jiang, Ya, Lan, Hongbo, Du, Jun, Wang, Qing, Niu, Shutong

In the two-person conversation scenario with one wearing smart glasses, transcribing and displaying the speaker's content in real-time is an intriguing application, providing a priori information for subsequent tasks such as translation and comprehen

Externí odkaz: http://arxiv.org/abs/2410.05986

Zobrazit plný text záznamu

Report

See then Tell: Enhancing Key Information Extraction with Vision Grounding

Autor: Liu, Shuhang, Zhang, Zhenrong, Hu, Pengfei, Ma, Jiefeng, Du, Jun, Wang, Qing, Zhang, Jianshu, Liu, Chenyu

In the digital era, the ability to understand visually rich documents that integrate text, complex layouts, and imagery is critical. Traditional Key Information Extraction (KIE) methods primarily rely on Optical Character Recognition (OCR), which oft

Externí odkaz: http://arxiv.org/abs/2409.19573

Zobrazit plný text záznamu

Report

Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

Autor: Wang, Ruoyu, Niu, Shutong, Yang, Gaobin, Du, Jun, Qian, Shuangqing, Gao, Tian, Pan, Jia

Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speak

Externí odkaz: http://arxiv.org/abs/2409.16803

Zobrazit plný text záznamu

Report

UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition

Autor: Zhang, Zhenrong, Liu, Shuhang, Hu, Pengfei, Ma, Jiefeng, Du, Jun, Zhang, Jianshu, Hu, Yu

In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively com

Externí odkaz: http://arxiv.org/abs/2409.13148

Zobrazit plný text záznamu

Report

DocMamba: Efficient Document Pre-training with State Space Model

Autor: Hu, Pengfei, Zhang, Zhenrong, Ma, Jiefeng, Liu, Shuhang, Du, Jun, Zhang, Jianshu

In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mech

Externí odkaz: http://arxiv.org/abs/2409.11887

Zobrazit plný text záznamu

Report

Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge

Autor: Xue, Hongfei, Gong, Rong, Shao, Mingchen, Xu, Xin, Wang, Lezhi, Xie, Lei, Bu, Hui, Zhou, Jiaming, Qin, Yong, Du, Jun, Li, Ming, Zhang, Binbin, Jia, Bin

The StutteringSpeech Challenge focuses on advancing speech technologies for people who stutter, specifically targeting Stuttering Event Detection (SED) and Automatic Speech Recognition (ASR) in Mandarin. The challenge comprises three tracks: (1) SED,

Externí odkaz: http://arxiv.org/abs/2409.05430

Zobrazit plný text záznamu

Report

The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap r

Externí odkaz: http://arxiv.org/abs/2409.02041

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání