Zobrazeno 1 - 10
of 60 420
pro vyhledávání: '"multimèdia"'
Autor:
Zhang, Weixiang, Xie, Shuzhao, Ren, Chengwei, Xie, Siyi, Tang, Chen, Ge, Shijia, Wang, Mingzi, Wang, Zhi
We propose EVOlutionary Selector (EVOS), an efficient training paradigm for accelerating Implicit Neural Representation (INR). Unlike conventional INR training that feeds all samples through the neural network in each iteration, our approach restrict
Externí odkaz:
http://arxiv.org/abs/2412.10153
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-qua
Externí odkaz:
http://arxiv.org/abs/2412.09608
Autor:
Zhong, Zhisheng, Wang, Chengyao, Liu, Yuqi, Yang, Senqiao, Tang, Longxiang, Zhang, Yuechen, Li, Jingyao, Qu, Tianyuan, Li, Yanwei, Chen, Yukang, Yu, Shaozuo, Wu, Sitong, Lo, Eric, Liu, Shu, Jia, Jiaya
As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its i
Externí odkaz:
http://arxiv.org/abs/2412.09501
The proliferation of AI-generated content and sophisticated video editing tools has made it both important and challenging to moderate digital platforms. Video watermarking addresses these challenges by embedding imperceptible signals into videos, al
Externí odkaz:
http://arxiv.org/abs/2412.09492
Autor:
Wang, Baisen, Zhuo, Le, Wang, Zhaokai, Bao, Chenxi, Chengjing, Wu, Nie, Xuecheng, Dai, Jiao, Han, Jizhong, Liao, Yue, Liu, Si
Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their applicati
Externí odkaz:
http://arxiv.org/abs/2412.09428
Autor:
Parascandolo, Fiorenzo, Moratelli, Nicholas, Sangineto, Enver, Baraldi, Lorenzo, Cucchiara, Rita
Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositi
Externí odkaz:
http://arxiv.org/abs/2412.09353
Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) tas
Externí odkaz:
http://arxiv.org/abs/2412.09329
Autor:
Fernandez, Antonio, Awinat, Suzan
Publikováno v:
Procedia Computer Science, Volume 251, 2024, Pages 41-48, ISSN 1877-0509
Despite the abundance of current researches working on the sentiment analysis from videos and audios, finding the best model that gives the highest accuracy rate is still considered a challenge for researchers in this field. The main objective of thi
Externí odkaz:
http://arxiv.org/abs/2412.09317
Autor:
Chen, Zihao, Zhang, Haomin, Di, Xinhan, Wang, Haoyu, Shan, Sizhe, Zheng, Junjie, Liang, Yunming, Fan, Yihan, Zhu, Xinfa, Tian, Wenjie, Wang, Yihua, Ding, Chaofan, Xie, Lei
Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real
Externí odkaz:
http://arxiv.org/abs/2412.09168
Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can asses
Externí odkaz:
http://arxiv.org/abs/2412.09126