Zobrazeno 1 - 10
of 3 518
pro vyhledávání: '"Koppula P"'
Autor:
Beyer, Lucas, Steiner, Andreas, Pinto, André Susano, Kolesnikov, Alexander, Wang, Xiao, Salz, Daniel, Neumann, Maxim, Alabdulmohsin, Ibrahim, Tschannen, Michael, Bugliarello, Emanuele, Unterthiner, Thomas, Keysers, Daniel, Koppula, Skanda, Liu, Fangyu, Grycner, Adam, Gritsenko, Alexey, Houlsby, Neil, Kumar, Manoj, Rong, Keran, Eisenschlos, Julian, Kabra, Rishabh, Bauer, Matthias, Bošnjak, Matko, Chen, Xi, Minderer, Matthias, Voigtlaender, Paul, Bica, Ioana, Balazevic, Ivana, Puigcerver, Joan, Papalampidi, Pinelopi, Henaff, Olivier, Xiong, Xi, Soricut, Radu, Harmsen, Jeremiah, Zhai, Xiaohua
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong
Externí odkaz:
http://arxiv.org/abs/2407.07726
Autor:
Koppula, Skanda, Rocco, Ignacio, Yang, Yi, Heyward, Joe, Carreira, João, Zisserman, Andrew, Brostow, Gabriel, Doersch, Carl
We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). While point tracking in two dimensions (TAP) has many benchmarks measuring performance on real-world videos, such as TAPVid-DAVIS, three
Externí odkaz:
http://arxiv.org/abs/2407.05921
Autor:
Balažević, Ivana, Shi, Yuge, Papalampidi, Pinelopi, Chaabouni, Rahma, Koppula, Skanda, Hénaff, Olivier J.
Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at the cost of both conceptual and computational complex
Externí odkaz:
http://arxiv.org/abs/2402.05861
Autor:
Doersch, Carl, Luc, Pauline, Yang, Yi, Gokay, Dilara, Koppula, Skanda, Gupta, Ankush, Heyward, Joseph, Rocco, Ignacio, Goroshin, Ross, Carreira, João, Zisserman, Andrew
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any
Externí odkaz:
http://arxiv.org/abs/2402.00847
Publikováno v:
49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024)
The Polynomial-Time Hierarchy ($\mathsf{PH}$) is a staple of classical complexity theory, with applications spanning randomized computation to circuit lower bounds to ''quantum advantage'' analyses for near-term quantum computers. Quantumly, however,
Externí odkaz:
http://arxiv.org/abs/2401.01633
Autor:
Papalampidi, Pinelopi, Koppula, Skanda, Pathak, Shreya, Chiu, Justin, Heyward, Joe, Patraucean, Viorica, Shen, Jiajun, Miech, Antoine, Zisserman, Andrew, Nematzdeh, Aida
Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow tempora
Externí odkaz:
http://arxiv.org/abs/2312.07395
Autor:
Su, Hsuan, Hu, Ting-Yao, Koppula, Hema Swetha, Vemulapalli, Raviteja, Chang, Jen-Hao Rick, Yang, Karren, Mantena, Gautam Varma, Tuzel, Oncel
While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readil
Externí odkaz:
http://arxiv.org/abs/2309.10707
Autor:
Pătrăucean, Viorica, Smaira, Lucas, Gupta, Ankush, Continente, Adrià Recasens, Markeeva, Larisa, Banarse, Dylan, Koppula, Skanda, Heyward, Joseph, Malinowski, Mateusz, Yang, Yi, Doersch, Carl, Matejovicova, Tatiana, Sulsky, Yury, Miech, Antoine, Frechette, Alex, Klimczak, Hanna, Koster, Raphael, Zhang, Junlin, Winkler, Stephanie, Aytar, Yusuf, Osindero, Simon, Damen, Dima, Zisserman, Andrew, Carreira, João
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational task
Externí odkaz:
http://arxiv.org/abs/2305.13786
Autor:
Sharma, Mohit, Fantacci, Claudio, Zhou, Yuxiang, Koppula, Skanda, Heess, Nicolas, Scholz, Jon, Aytar, Yusuf
Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robot
Externí odkaz:
http://arxiv.org/abs/2304.06600
Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, w
Externí odkaz:
http://arxiv.org/abs/2303.14885