Výsledky vyhledávání - "Zhai, Xiaohua"

Report

Toward a Diffusion-Based Generalist for Dense Vision Tasks

Autor: Fan, Yue, Xian, Yongqin, Zhai, Xiaohua, Kolesnikov, Alexander, Naeem, Muhammad Ferjad, Schiele, Bernt, Tombari, Federico

Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring

Externí odkaz: http://arxiv.org/abs/2407.00503

Zobrazit plný text záznamu

Report

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Autor: Pouget, Angéline, Beyer, Lucas, Bugliarello, Emanuele, Wang, Xiao, Steiner, Andreas Peter, Zhai, Xiaohua, Alabdulmohsin, Ibrahim

We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. First, the common filtering of training da

Externí odkaz: http://arxiv.org/abs/2405.13777

Zobrazit plný text záznamu

Report

LocCa: Visual Pretraining with Location-aware Captioners

Autor: Wan, Bo, Tschannen, Michael, Xian, Yongqin, Pavetic, Filip, Alabdulmohsin, Ibrahim, Wang, Xiao, Pinto, André Susano, Steiner, Andreas, Beyer, Lucas, Zhai, Xiaohua

Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a

Externí odkaz: http://arxiv.org/abs/2403.19596

Zobrazit plný text záznamu

Report

CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

Autor: Alabdulmohsin, Ibrahim, Wang, Xiao, Steiner, Andreas, Goyal, Priya, D'Amour, Alexander, Zhai, Xiaohua

Publikováno v: ICLR 2024

We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal

Externí odkaz: http://arxiv.org/abs/2403.04547

Zobrazit plný text záznamu

Report

SILC: Improving Vision Language Pretraining with Self-Distillation

Autor: Naeem, Muhammad Ferjad, Xian, Yongqin, Zhai, Xiaohua, Hoyer, Lukas, Van Gool, Luc, Tombari, Federico

Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense pred

Externí odkaz: http://arxiv.org/abs/2310.13355

Zobrazit plný text záznamu

Report

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Autor: Chen, Xi, Wang, Xiao, Beyer, Lucas, Kolesnikov, Alexander, Wu, Jialin, Voigtlaender, Paul, Mustafa, Basil, Goodman, Sebastian, Alabdulmohsin, Ibrahim, Padlewski, Piotr, Salz, Daniel, Xiong, Xi, Vlasic, Daniel, Pavetic, Filip, Rong, Keran, Yu, Tianli, Keysers, Daniel, Zhai, Xiaohua, Soricut, Radu

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrain

Externí odkaz: http://arxiv.org/abs/2310.09199

Zobrazit plný text záznamu

Report

Image Captioners Are Scalable Vision Learners Too

Autor: Tschannen, Michael, Kumar, Manoj, Steiner, Andreas, Zhai, Xiaohua, Houlsby, Neil, Beyer, Lucas

Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data

Externí odkaz: http://arxiv.org/abs/2306.07915

Zobrazit plný text záznamu

Report

PaLI-X: On Scaling up a Multilingual Vision and Language Model

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-ra

Externí odkaz: http://arxiv.org/abs/2305.18565

Zobrazit plný text záznamu

Report

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Autor: Kossen, Jannik, Collier, Mark, Mustafa, Basil, Wang, Xiao, Zhai, Xiaohua, Beyer, Lucas, Steiner, Andreas, Berent, Jesse, Jenatton, Rodolphe, Kokiopoulou, Efi

We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has rece

Externí odkaz: http://arxiv.org/abs/2305.16999

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání