Zobrazeno 1 - 10
of 102
pro vyhledávání: '"Carreira, Joāo"'
Autor:
van Steenkiste, Sjoerd, Zoran, Daniel, Yang, Yi, Rubanova, Yulia, Kabra, Rishabh, Doersch, Carl, Gokay, Dilara, Heyward, Joseph, Pot, Etienne, Greff, Klaus, Hudson, Drew A., Keck, Thomas Albert, Carreira, Joao, Dosovitskiy, Alexey, Sajjadi, Mehdi S. M., Kipf, Thomas
Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific sp
Externí odkaz:
http://arxiv.org/abs/2411.05927
Autor:
Koppula, Skanda, Rocco, Ignacio, Yang, Yi, Heyward, Joe, Carreira, João, Zisserman, Andrew, Brostow, Gabriel, Doersch, Carl
We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). While point tracking in two dimensions (TAP) has many benchmarks measuring performance on real-world videos, such as TAPVid-DAVIS, three
Externí odkaz:
http://arxiv.org/abs/2407.05921
Autor:
Doersch, Carl, Luc, Pauline, Yang, Yi, Gokay, Dilara, Koppula, Skanda, Gupta, Ankush, Heyward, Joseph, Rocco, Ignacio, Goroshin, Ross, Carreira, João, Zisserman, Andrew
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any
Externí odkaz:
http://arxiv.org/abs/2402.00847
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test b
Externí odkaz:
http://arxiv.org/abs/2312.13090
Autor:
Carreira, João, King, Michael, Pătrăucean, Viorica, Gokay, Dilara, Ionescu, Cătălin, Yang, Yi, Zoran, Daniel, Heyward, Joseph, Doersch, Carl, Aytar, Yusuf, Damen, Dima, Zisserman, Andrew
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive v
Externí odkaz:
http://arxiv.org/abs/2312.00598
Autor:
Venkataramanan, Shashanka, Rizve, Mamshad Nayeem, Carreira, João, Asano, Yuki M., Avrithis, Yannis
Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this quest
Externí odkaz:
http://arxiv.org/abs/2310.08584
Autor:
Doersch, Carl, Yang, Yi, Vecerik, Mel, Gokay, Dilara, Gupta, Ankush, Aytar, Yusuf, Carreira, Joao, Zisserman, Andrew
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candida
Externí odkaz:
http://arxiv.org/abs/2306.08637
Autor:
Pătrăucean, Viorica, Smaira, Lucas, Gupta, Ankush, Continente, Adrià Recasens, Markeeva, Larisa, Banarse, Dylan, Koppula, Skanda, Heyward, Joseph, Malinowski, Mateusz, Yang, Yi, Doersch, Carl, Matejovicova, Tatiana, Sulsky, Yury, Miech, Antoine, Frechette, Alex, Klimczak, Hanna, Koster, Raphael, Zhang, Junlin, Winkler, Stephanie, Aytar, Yusuf, Osindero, Simon, Damen, Dima, Zisserman, Andrew, Carreira, João
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational task
Externí odkaz:
http://arxiv.org/abs/2305.13786
Autor:
Recasens, Adrià, Lin, Jason, Carreira, Joāo, Jaegle, Drew, Wang, Luyu, Alayrac, Jean-baptiste, Luc, Pauline, Miech, Antoine, Smaira, Lucas, Hemsley, Ross, Zisserman, Andrew
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however
Externí odkaz:
http://arxiv.org/abs/2301.09595
Autor:
Doersch, Carl, Gupta, Ankush, Markeeva, Larisa, Recasens, Adrià, Smaira, Lucas, Aytar, Yusuf, Carreira, João, Zisserman, Andrew, Yang, Yi
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the p
Externí odkaz:
http://arxiv.org/abs/2211.03726