Zobrazeno 1 - 10
of 427
pro vyhledávání: '"Wu Enhua"'
Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While training-based vide
Externí odkaz:
http://arxiv.org/abs/2410.10441
Joint-embedding predictive architectures (JEPAs) have shown substantial promise in self-supervised representation learning, yet their application in generative modeling remains underexplored. Conversely, diffusion models have demonstrated significant
Externí odkaz:
http://arxiv.org/abs/2410.03755
Autor:
Liu, Jiajun, Wang, Yibing, Ma, Hanghang, Wu, Xiaoping, Ma, Xiaoqi, Wei, Xiaoming, Jiao, Jianbin, Wu, Enhua, Hu, Jie
Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient ac
Externí odkaz:
http://arxiv.org/abs/2408.15542
Incorporating a temporal dimension into pretrained image diffusion models for video generation is a prevalent approach. However, this method is computationally demanding and necessitates large-scale video datasets. More critically, the heterogeneity
Externí odkaz:
http://arxiv.org/abs/2407.21475
The Gaussian diffusion model, initially designed for image generation, has recently been adapted for 3D point cloud generation. However, these adaptations have not fully considered the intrinsic geometric characteristics of 3D shapes, thereby constra
Externí odkaz:
http://arxiv.org/abs/2407.21428
Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues
Externí odkaz:
http://arxiv.org/abs/2405.04042
The large-scale visual pretraining has significantly improve the performance of large vision models. However, we observe the \emph{low FLOPs pitfall} that the existing low-FLOPs models cannot benefit from large-scale pretraining. In this paper, we in
Externí odkaz:
http://arxiv.org/abs/2306.14525
This paper proposes a Robust and Efficient Memory Network, referred to as REMN, for studying semi-supervised video object segmentation (VOS). Memory-based methods have recently achieved outstanding VOS performance by performing non-local pixel-wise m
Externí odkaz:
http://arxiv.org/abs/2304.11840
Deep neural networks have been proven effective in a wide range of tasks. However, their high computational and memory costs make them impractical to deploy on resource-constrained devices. To address this issue, quantization schemes have been propos
Externí odkaz:
http://arxiv.org/abs/2303.07080
Existing lifting networks for regressing 3D human poses from 2D single-view poses are typically constructed with linear layers based on graph-structured representation learning. In sharp contrast to them, this paper presents Grid Convolution (GridCon
Externí odkaz:
http://arxiv.org/abs/2302.08760