Zobrazeno 1 - 10
of 24
pro vyhledávání: '"Birodkar, Vighnesh"'
Autor:
Birodkar, Vighnesh, Barcik, Gabriel, Lyon, James, Ioffe, Sergey, Minnen, David, Dillon, Joshua V.
For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a princi
Externí odkaz:
http://arxiv.org/abs/2409.02529
Autor:
Kondratyuk, Dan, Yu, Lijun, Gu, Xiuye, Lezama, José, Huang, Jonathan, Schindler, Grant, Hornung, Rachel, Birodkar, Vighnesh, Yan, Jimmy, Chiu, Ming-Chang, Somandepalli, Krishna, Akbari, Hassan, Alon, Yair, Cheng, Yong, Dillon, Josh, Gupta, Agrim, Hahn, Meera, Hauth, Anja, Hendon, David, Martinez, Alonso, Minnen, David, Sirotenko, Mikhail, Sohn, Kihyuk, Yang, Xuan, Adam, Hartwig, Yang, Ming-Hsuan, Essa, Irfan, Wang, Huisheng, Ross, David A., Seybold, Bryan, Jiang, Lu
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- includ
Externí odkaz:
http://arxiv.org/abs/2312.14125
Autor:
Yu, Lijun, Lezama, José, Gundavarapu, Nitesh B., Versari, Luca, Sohn, Kihyuk, Minnen, David, Cheng, Yong, Birodkar, Vighnesh, Gupta, Agrim, Gu, Xiuye, Hauptmann, Alexander G., Gong, Boqing, Yang, Ming-Hsuan, Essa, Irfan, Ross, David A., Jiang, Lu
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the vi
Externí odkaz:
http://arxiv.org/abs/2310.05737
Autor:
Dehghani, Mostafa, Djolonga, Josip, Mustafa, Basil, Padlewski, Piotr, Heek, Jonathan, Gilmer, Justin, Steiner, Andreas, Caron, Mathilde, Geirhos, Robert, Alabdulmohsin, Ibrahim, Jenatton, Rodolphe, Beyer, Lucas, Tschannen, Michael, Arnab, Anurag, Wang, Xiao, Riquelme, Carlos, Minderer, Matthias, Puigcerver, Joan, Evci, Utku, Kumar, Manoj, van Steenkiste, Sjoerd, Elsayed, Gamaleldin F., Mahendran, Aravindh, Yu, Fisher, Oliver, Avital, Huot, Fantine, Bastings, Jasmijn, Collier, Mark Patrick, Gritsenko, Alexey, Birodkar, Vighnesh, Vasconcelos, Cristina, Tay, Yi, Mensink, Thomas, Kolesnikov, Alexander, Pavetić, Filip, Tran, Dustin, Kipf, Thomas, Lučić, Mario, Zhai, Xiaohua, Keysers, Daniel, Harmsen, Jeremiah, Houlsby, Neil
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image an
Externí odkaz:
http://arxiv.org/abs/2302.05442
Autor:
Rathod, Vivek, Seybold, Bryan, Vijayanarasimhan, Sudheendra, Myers, Austin, Gu, Xiuye, Birodkar, Vighnesh, Ross, David A.
Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trai
Externí odkaz:
http://arxiv.org/abs/2212.10596
Publikováno v:
CVPR 2022
A common practice in transfer learning is to initialize the downstream model weights by pre-training on a data-abundant upstream task. In object detection specifically, the feature backbone is typically initialized with Imagenet classifier weights an
Externí odkaz:
http://arxiv.org/abs/2204.00484
Autor:
Wang, Su, Montgomery, Ceslee, Orbay, Jordi, Birodkar, Vighnesh, Faust, Aleksandra, Gur, Izzeddin, Jaques, Natasha, Waters, Austin, Baldridge, Jason, Anderson, Peter
We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 syste
Externí odkaz:
http://arxiv.org/abs/2111.12872
Camera traps enable the automatic collection of large quantities of image data. Ecologists use camera traps to monitor animal populations all over the world. In order to estimate the abundance of a species from camera trap data, ecologists need to kn
Externí odkaz:
http://arxiv.org/abs/2105.03494
Instance segmentation models today are very accurate when trained on large annotated datasets, but collecting mask annotations at scale is prohibitively expensive. We address the partially supervised instance segmentation problem in which one can tra
Externí odkaz:
http://arxiv.org/abs/2104.00613
In modern computer vision tasks, convolutional neural networks (CNNs) are indispensable for image classification tasks due to their efficiency and effectiveness. Part of their superiority compared to other architectures, comes from the fact that a si
Externí odkaz:
http://arxiv.org/abs/1906.03808