Zobrazeno 1 - 10
of 95
pro vyhledávání: '"Adam Hartwig"'
Autor:
Zhao, Long, Woo, Sanghyun, Wan, Ziyu, Li, Yandong, Zhang, Han, Gong, Boqing, Adam, Hartwig, Jia, Xuhui, Liu, Ting
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality ge
Externí odkaz:
http://arxiv.org/abs/2410.04081
Autor:
Zhao, Long, Gundavarapu, Nitesh B., Yuan, Liangzhe, Zhou, Hao, Yan, Shen, Sun, Jennifer J., Friedman, Luke, Qian, Rui, Weyand, Tobias, Zhao, Yue, Hornung, Rachel, Schroff, Florian, Yang, Ming-Hsuan, Ross, David A., Wang, Huisheng, Adam, Hartwig, Sirotenko, Mikhail, Liu, Ting, Gong, Boqing
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips
Externí odkaz:
http://arxiv.org/abs/2402.13217
Autor:
Zhao, Yue, Zhao, Long, Zhou, Xingyi, Wu, Jialin, Chu, Chun-Te, Miao, Hui, Schroff, Florian, Adam, Hartwig, Liu, Ting, Gong, Boqing, Krähenbühl, Philipp, Yuan, Liangzhe
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort
Externí odkaz:
http://arxiv.org/abs/2401.06129
Autor:
Kondratyuk, Dan, Yu, Lijun, Gu, Xiuye, Lezama, José, Huang, Jonathan, Schindler, Grant, Hornung, Rachel, Birodkar, Vighnesh, Yan, Jimmy, Chiu, Ming-Chang, Somandepalli, Krishna, Akbari, Hassan, Alon, Yair, Cheng, Yong, Dillon, Josh, Gupta, Agrim, Hahn, Meera, Hauth, Anja, Hendon, David, Martinez, Alonso, Minnen, David, Sirotenko, Mikhail, Sohn, Kihyuk, Yang, Xuan, Adam, Hartwig, Yang, Ming-Hsuan, Essa, Irfan, Wang, Huisheng, Ross, David A., Seybold, Bryan, Jiang, Lu
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- includ
Externí odkaz:
http://arxiv.org/abs/2312.14125
Autor:
Yang, Xuan, Yuan, Liangzhe, Wilber, Kimberly, Sharma, Astuti, Gu, Xiuye, Qiao, Siyuan, Debats, Stephanie, Wang, Huisheng, Adam, Hartwig, Sirotenko, Mikhail, Chen, Liang-Chieh
Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has
Externí odkaz:
http://arxiv.org/abs/2311.05770
Autor:
Waghmare, Sagar M., Wilber, Kimberly, Hawkey, Dave, Yang, Xuan, Wilson, Matthew, Debats, Stephanie, Nuengsigkapian, Cattalyya, Sharma, Astuti, Pandikow, Lars, Wang, Huisheng, Adam, Hartwig, Sirotenko, Mikhail
We introduce SANPO, a large-scale egocentric video dataset focused on dense prediction in outdoor environments. It contains stereo video sessions collected across diverse outdoor environments, as well as rendered synthetic video sessions. (Synthetic
Externí odkaz:
http://arxiv.org/abs/2309.12172
Autor:
Yuan, Liangzhe, Gundavarapu, Nitesh Bharadwaj, Zhao, Long, Zhou, Hao, Cui, Yin, Jiang, Lu, Yang, Xuan, Jia, Menglin, Weyand, Tobias, Friedman, Luke, Sirotenko, Mikhail, Wang, Huisheng, Schroff, Florian, Adam, Hartwig, Yang, Ming-Hsuan, Liu, Ting, Gong, Boqing
We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition,temporal localization, and spatiotemporal localization), eight
Externí odkaz:
http://arxiv.org/abs/2307.03166
We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal mod
Externí odkaz:
http://arxiv.org/abs/2305.06324
Autor:
Zhao, Long, Yuan, Liangzhe, Gong, Boqing, Cui, Yin, Schroff, Florian, Yang, Ming-Hsuan, Adam, Hartwig, Liu, Ting
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerb
Externí odkaz:
http://arxiv.org/abs/2303.08998
Autor:
Ge, Yunhao, Ren, Jie, Gallagher, Andrew, Wang, Yuxiao, Yang, Ming-Hsuan, Adam, Hartwig, Itti, Laurent, Lakshminarayanan, Balaji, Zhao, Jiaping
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models a
Externí odkaz:
http://arxiv.org/abs/2212.01758