Zobrazeno 1 - 10
of 35
pro vyhledávání: '"Ma, Wufei"'
Autor:
Ma, Wufei, Li, Kai, Jiang, Zhongshi, Meshry, Moustafa, Liu, Qihao, Wang, Huiyu, Häne, Christian, Yuille, Alan
Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could
Externí odkaz:
http://arxiv.org/abs/2407.13094
Autor:
Ma, Wufei, Zeng, Guanning, Zhang, Guofeng, Liu, Qihao, Zhang, Letian, Kortylewski, Adam, Liu, Yaoyao, Yuille, Alan
A vision model with general-purpose object-level 3D understanding should be capable of inferring both 2D (e.g., class name and bounding box) and 3D information (e.g., 3D location and 3D viewpoint) for arbitrary rigid objects in natural images. This i
Externí odkaz:
http://arxiv.org/abs/2406.09613
For vision-language models (VLMs), understanding the dynamic properties of objects and their interactions within 3D scenes from video is crucial for effective reasoning. In this work, we introduce a video question answering dataset SuperCLEVR-Physics
Externí odkaz:
http://arxiv.org/abs/2406.00622
Deep learning-based video compression is a challenging task, and many previous state-of-the-art learning-based video codecs use optical flows to exploit the temporal correlation between successive frames and then compress the residual error. Although
Externí odkaz:
http://arxiv.org/abs/2403.19158
Despite rapid progress in Visual question answering (VQA), existing datasets and models mainly focus on testing reasoning in 2D. However, it is important that VQA models also understand the 3D structure of visual scenes, for example to support tasks
Externí odkaz:
http://arxiv.org/abs/2310.17914
Autor:
Xu, Jiacong, Zhang, Yi, Peng, Jiawei, Ma, Wufei, Jesslen, Artur, Ji, Pengliang, Hu, Qixin, Zhang, Jiehua, Liu, Qihao, Wang, Jiahao, Ji, Wei, Wang, Chen, Yuan, Xiaoding, Kaushik, Prakhar, Zhang, Guofeng, Liu, Jie, Xie, Yushan, Cui, Yawen, Yuille, Alan, Kortylewski, Adam
Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many downstream applications, such as wildlife conservation. However, research in this area is held back by the lack o
Externí odkaz:
http://arxiv.org/abs/2308.11737
Autor:
Ma, Wufei, Liu, Qihao, Wang, Jiahao, Wang, Angtian, Yuan, Xiaoding, Zhang, Yi, Xiao, Zihao, Zhang, Guofeng, Lu, Beijia, Duan, Ruxiao, Qi, Yongrui, Kortylewski, Adam, Liu, Yaoyao, Yuille, Alan
Diffusion models have emerged as a powerful generative method, capable of producing stunning photo-realistic images from natural language descriptions. However, these models lack explicit control over the 3D structure in the generated images. Consequ
Externí odkaz:
http://arxiv.org/abs/2306.08103
Human vision demonstrates higher robustness than current AI algorithms under out-of-distribution scenarios. It has been conjectured such robustness benefits from performing analysis-by-synthesis. Our paper formulates triple vision tasks in a consiste
Externí odkaz:
http://arxiv.org/abs/2306.00118
Obtaining accurate 3D object poses is vital for numerous computer vision applications, such as 3D reconstruction and scene understanding. However, annotating real-world objects is time-consuming and challenging. While synthetically generated training
Externí odkaz:
http://arxiv.org/abs/2305.16124
Discriminative models for object classification typically learn image-based representations that do not capture the compositional and 3D nature of objects. In this work, we show that explicitly integrating 3D compositional object representations into
Externí odkaz:
http://arxiv.org/abs/2305.14668