Zobrazeno 1 - 10
of 958
pro vyhledávání: '"Hu, Wenbo"'
Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more benefici
Externí odkaz:
http://arxiv.org/abs/2410.08182
Autor:
Zhao, Sijie, Hu, Wenbo, Cun, Xiaodong, Zhang, Yong, Li, Xiaoyu, Kong, Zhe, Gao, Xiangjun, Niu, Muyao, Shan, Ying
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of tradi
Externí odkaz:
http://arxiv.org/abs/2409.07447
Autor:
Hu, Wenbo, Gao, Xiangjun, Li, Xiaoyu, Zhao, Sijie, Cun, Xiaodong, Zhang, Yong, Quan, Long, Shan, Ying
Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present
Externí odkaz:
http://arxiv.org/abs/2409.02095
Autor:
Yu, Wangbo, Xing, Jinbo, Yuan, Li, Hu, Wenbo, Li, Xiaoyu, Huang, Zhipeng, Gao, Xiangjun, Wong, Tien-Tsin, Shan, Ying, Tian, Yonghong
Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbf{ViewCrafter}, a novel method for synthesizing high-fidelity novel views of ge
Externí odkaz:
http://arxiv.org/abs/2409.02048
Autor:
Zhao, Sijie, Zhang, Yong, Cun, Xiaodong, Yang, Shaoshu, Niu, Muyao, Li, Xiaoyu, Hu, Wenbo, Shan, Ying
Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of dis
Externí odkaz:
http://arxiv.org/abs/2405.20279
Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computation
Externí odkaz:
http://arxiv.org/abs/2405.19315
Autor:
Gao, Xiangjun, Li, Xiaoyu, Zhuang, Yiyu, Zhang, Qi, Hu, Wenbo, Zhang, Chaopeng, Yao, Yao, Shan, Ying, Quan, Long
Neural 3D representations such as Neural Radiance Fields (NeRF), excel at producing photo-realistic rendering results but lack the flexibility for manipulation and editing which is crucial for content creation. Previous works have attempted to addres
Externí odkaz:
http://arxiv.org/abs/2405.17811
Autor:
Liu, Junchen, Hu, Wenbo, Yang, Zhuo, Chen, Jianteng, Wang, Guoliang, Chen, Xiaoxue, Cai, Yantong, Gao, Huan-ang, Zhao, Hao
Despite significant advancements in Neural Radiance Fields (NeRFs), the renderings may still suffer from aliasing and blurring artifacts, since it remains a fundamental challenge to effectively and efficiently characterize anisotropic areas induced b
Externí odkaz:
http://arxiv.org/abs/2405.02386
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and
Externí odkaz:
http://arxiv.org/abs/2404.13874
Inverse rendering aims at recovering both geometry and materials of objects. It provides a more compatible reconstruction for conventional rendering engines, compared with the neural radiance fields (NeRFs). On the other hand, existing NeRF-based inv
Externí odkaz:
http://arxiv.org/abs/2403.16224