Zobrazeno 1 - 10
of 157
pro vyhledávání: '"Gong, Boqing"'
We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and
Externí odkaz:
http://arxiv.org/abs/2409.18300
Publikováno v:
Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently
Externí odkaz:
http://arxiv.org/abs/2407.01606
The concept of negative prompts, emerging from conditional generation models like Stable Diffusion, allows users to specify what to exclude from the generated images.%, demonstrating significant practical efficacy. Despite the widespread use of negat
Externí odkaz:
http://arxiv.org/abs/2406.02965
Diffusion models have achieved remarkable success in text-to-image generation tasks; however, the role of initial noise has been rarely explored. In this study, we identify specific regions within the initial noise image, termed trigger patches, that
Externí odkaz:
http://arxiv.org/abs/2406.01970
Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there a
Externí odkaz:
http://arxiv.org/abs/2405.16567
Autor:
Zhang, Zheyuan, Keles, Elif, Durak, Gorkem, Taktak, Yavuz, Susladkar, Onkar, Gorade, Vandan, Jha, Debesh, Ormeci, Asli C., Medetalibeyoglu, Alpay, Yao, Lanhong, Wang, Bin, Isler, Ilkin Sevgi, Peng, Linkai, Pan, Hongyi, Vendrami, Camila Lopes, Bourhani, Amir, Velichko, Yury, Gong, Boqing, Spampinato, Concetto, Pyrros, Ayis, Tiwari, Pallavi, Klatte, Derk C. F., Engels, Megan, Hoogenboom, Sanne, Bolan, Candice W., Agarunov, Emil, Harfouch, Nassier, Huang, Chenchan, Bruno, Marco J., Schoots, Ivo, Keswani, Rajesh N., Miller, Frank H., Gonda, Tamas, Yazici, Cemal, Tirkes, Temel, Turkbey, Baris, Wallace, Michael B., Bagci, Ulas
Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, la
Externí odkaz:
http://arxiv.org/abs/2405.12367
Autor:
Zhao, Long, Gundavarapu, Nitesh B., Yuan, Liangzhe, Zhou, Hao, Yan, Shen, Sun, Jennifer J., Friedman, Luke, Qian, Rui, Weyand, Tobias, Zhao, Yue, Hornung, Rachel, Schroff, Florian, Yang, Ming-Hsuan, Ross, David A., Wang, Huisheng, Adam, Hartwig, Sirotenko, Mikhail, Liu, Ting, Gong, Boqing
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips
Externí odkaz:
http://arxiv.org/abs/2402.13217
Autor:
Zhao, Yue, Zhao, Long, Zhou, Xingyi, Wu, Jialin, Chu, Chun-Te, Miao, Hui, Schroff, Florian, Adam, Hartwig, Liu, Ting, Gong, Boqing, Krähenbühl, Philipp, Yuan, Liangzhe
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort
Externí odkaz:
http://arxiv.org/abs/2401.06129
Autor:
Hu, Hexiang, Chan, Kelvin C. K., Su, Yu-Chuan, Chen, Wenhu, Li, Yandong, Sohn, Kihyuk, Zhao, Yang, Ben, Xue, Gong, Boqing, Cohen, William, Chang, Ming-Wei, Jia, Xuhui
This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation
Externí odkaz:
http://arxiv.org/abs/2401.01952
Recognition and reasoning are two pillars of visual understanding. However, these tasks have an imbalance in focus; whereas recent advances in neural networks have shown strong empirical performance in visual recognition, there has been comparably mu
Externí odkaz:
http://arxiv.org/abs/2311.06386