Zobrazeno 1 - 10
of 65
pro vyhledávání: '"Gokhale, Tejas A."'
Autor:
Patel, Maitreya, Kusumba, Abhiram, Cheng, Sheng, Kim, Changhoon, Gokhale, Tejas, Baral, Chitta, Yang, Yezhou
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream t
Externí odkaz:
http://arxiv.org/abs/2411.02545
Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reaso
Externí odkaz:
http://arxiv.org/abs/2408.02231
Domain Generalization (DG) is a challenging task in machine learning that requires a coherent ability to comprehend shifts across various domains through extraction of domain-invariant features. DG performance is typically evaluated by performing ima
Externí odkaz:
http://arxiv.org/abs/2405.15961
Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, rem
Externí odkaz:
http://arxiv.org/abs/2404.08540
Autor:
Saha, Sourajit, Gokhale, Tejas
Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework
Externí odkaz:
http://arxiv.org/abs/2404.07410
Autor:
Chatterjee, Agneet, Stan, Gabriela Ben Melech, Aflalo, Estelle, Paul, Sayak, Ghosh, Dhruba, Gokhale, Tejas, Schmidt, Ludwig, Hajishirzi, Hannaneh, Lal, Vasudev, Baral, Chitta, Yang, Yezhou
One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation
Externí odkaz:
http://arxiv.org/abs/2404.01197
Generalizing to unseen image domains is a challenging problem primarily due to the lack of diverse training data, inaccessible target data, and the large domain shift that may exist in many real-world settings. As such data augmentation is a critical
Externí odkaz:
http://arxiv.org/abs/2307.09520
The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by
Externí odkaz:
http://arxiv.org/abs/2306.04695
We investigate knowledge retrieval with multi-modal queries, i.e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval. We curate a new dataset called ReMuQ for
Externí odkaz:
http://arxiv.org/abs/2306.00424
In this work, we present a data poisoning attack that confounds machine learning models without any manipulation of the image or label. This is achieved by simply leveraging the most confounding natural samples found within the training data itself,
Externí odkaz:
http://arxiv.org/abs/2303.17080