Zobrazeno 1 - 10
of 1 630
pro vyhledávání: '"A. Tokmakov"'
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key
Externí odkaz:
http://arxiv.org/abs/2410.23287
3D reconstruction from a single image is a long-standing problem in computer vision. Learning-based methods address its inherent scale ambiguity by leveraging increasingly large labeled and unlabeled datasets, to produce geometric priors capable of g
Externí odkaz:
http://arxiv.org/abs/2409.09896
Autor:
Liang, Junbang, Liu, Ruoshi, Ozguroglu, Ege, Sudhakar, Sruthi, Dave, Achal, Tokmakov, Pavel, Song, Shuran, Vondrick, Carl
A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets
Externí odkaz:
http://arxiv.org/abs/2406.16862
Autor:
Van Hoorick, Basile, Wu, Rundi, Ozguroglu, Ege, Sargent, Kyle, Liu, Ruoshi, Tokmakov, Pavel, Dave, Achal, Zheng, Changxi, Vondrick, Carl
Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessi
Externí odkaz:
http://arxiv.org/abs/2405.14868
Autor:
Ozguroglu, Ege, Liu, Ruoshi, Surís, Dídac, Chen, Dian, Dave, Achal, Tokmakov, Pavel, Vondrick, Carl
We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transfe
Externí odkaz:
http://arxiv.org/abs/2401.14398
Autor:
Kowal, Matthew, Dave, Achal, Ambrus, Rares, Gaidon, Adrien, Derpanis, Konstantinos G., Tokmakov, Pavel
This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automat
Externí odkaz:
http://arxiv.org/abs/2401.10831
Autor:
Chu, Wen-Hsuan, Harley, Adam W., Tokmakov, Pavel, Dave, Achal, Guibas, Leonidas, Fragkiadaki, Katerina
Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances
Externí odkaz:
http://arxiv.org/abs/2310.06992
Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce $\textbf{TCOW}$, a new benchmark and model for visual tracking through heavy occlusion and
Externí odkaz:
http://arxiv.org/abs/2305.03052
Publikováno v:
CVPR 2023
Object discovery -- separating objects from the background without manual labels -- is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, texture) or
Externí odkaz:
http://arxiv.org/abs/2303.15555
Autor:
Liu, Ruoshi, Wu, Rundi, Van Hoorick, Basile, Tokmakov, Pavel, Zakharov, Sergey, Vondrick, Carl
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion mo
Externí odkaz:
http://arxiv.org/abs/2303.11328