Výsledky vyhledávání - "Beyer, Lucas"

Report

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Autor: Pouget, Angéline, Beyer, Lucas, Bugliarello, Emanuele, Wang, Xiao, Steiner, Andreas Peter, Zhai, Xiaohua, Alabdulmohsin, Ibrahim

We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. First, the common filtering of training da

Externí odkaz: http://arxiv.org/abs/2405.13777

Zobrazit plný text záznamu

Report

LocCa: Visual Pretraining with Location-aware Captioners

Autor: Wan, Bo, Tschannen, Michael, Xian, Yongqin, Pavetic, Filip, Alabdulmohsin, Ibrahim, Wang, Xiao, Pinto, André Susano, Steiner, Andreas, Beyer, Lucas, Zhai, Xiaohua

Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a

Externí odkaz: http://arxiv.org/abs/2403.19596

Zobrazit plný text záznamu

Report

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Autor: Chen, Xi, Wang, Xiao, Beyer, Lucas, Kolesnikov, Alexander, Wu, Jialin, Voigtlaender, Paul, Mustafa, Basil, Goodman, Sebastian, Alabdulmohsin, Ibrahim, Padlewski, Piotr, Salz, Daniel, Xiong, Xi, Vlasic, Daniel, Pavetic, Filip, Rong, Keran, Yu, Tianli, Keysers, Daniel, Zhai, Xiaohua, Soricut, Radu

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrain

Externí odkaz: http://arxiv.org/abs/2310.09199

Zobrazit plný text záznamu

Report

Image Captioners Are Scalable Vision Learners Too

Autor: Tschannen, Michael, Kumar, Manoj, Steiner, Andreas, Zhai, Xiaohua, Houlsby, Neil, Beyer, Lucas

Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data

Externí odkaz: http://arxiv.org/abs/2306.07915

Zobrazit plný text záznamu

Report

PaLI-X: On Scaling up a Multilingual Vision and Language Model

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-ra

Externí odkaz: http://arxiv.org/abs/2305.18565

Zobrazit plný text záznamu

Report

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Autor: Kossen, Jannik, Collier, Mark, Mustafa, Basil, Wang, Xiao, Zhai, Xiaohua, Beyer, Lucas, Steiner, Andreas, Berent, Jesse, Jenatton, Rodolphe, Kokiopoulou, Efi

We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has rece

Externí odkaz: http://arxiv.org/abs/2305.16999

Zobrazit plný text záznamu

Report

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Autor: Alabdulmohsin, Ibrahim, Zhai, Xiaohua, Kolesnikov, Alexander, Beyer, Lucas

Publikováno v: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully imp

Externí odkaz: http://arxiv.org/abs/2305.13035

Zobrazit plný text záznamu

Report

A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

Autor: Beyer, Lucas, Wan, Bo, Madan, Gagan, Pavetic, Filip, Steiner, Andreas, Kolesnikov, Alexander, Pinto, André Susano, Bugliarello, Emanuele, Wang, Xiao, Yu, Qihang, Chen, Liang-Chieh, Zhai, Xiaohua

There has been a recent explosion of computer vision models which perform many tasks and are composed of an image encoder (usually a ViT) and an autoregressive decoder (usually a Transformer). However, most of this work simply presents one system and

Externí odkaz: http://arxiv.org/abs/2303.17376

Zobrazit plný text záznamu

Report

Sigmoid Loss for Language Image Pre-Training

Autor: Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, Lucas

We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwi

Externí odkaz: http://arxiv.org/abs/2303.15343

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání