Zobrazeno 1 - 10
of 29
pro vyhledávání: '"Selvaraju, Ramprasaath R."'
Autor:
Wang, Jun, Gao, Mingfei, Hu, Yuqian, Selvaraju, Ramprasaath R., Ramaiah, Chetan, Xu, Ran, JaJa, Joseph F., Davis, Larry S.
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we o
Externí odkaz:
http://arxiv.org/abs/2208.01813
Despite the rapid progress in deep visual recognition, modern computer vision datasets significantly overrepresent the developed world and models trained on such datasets underperform on images from unseen geographies. We investigate the effectivenes
Externí odkaz:
http://arxiv.org/abs/2204.11122
We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for ev
Externí odkaz:
http://arxiv.org/abs/2112.07133
Videos are a rich source for self-supervised learning (SSL) of visual representations due to the presence of natural temporal transformations of objects. However, current methods typically randomly sample video clips for learning, which results in an
Externí odkaz:
http://arxiv.org/abs/2112.00804
Autor:
Li, Junnan, Selvaraju, Ramprasaath R., Gotmare, Akhilesh Deepak, Joty, Shafiq, Xiong, Caiming, Hoi, Steven
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features
Externí odkaz:
http://arxiv.org/abs/2107.07651
Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining. Despite their success these methods have been primarily applied to unlabeled ImageNet images, and show marginal gains when trained on
Externí odkaz:
http://arxiv.org/abs/2012.04630
Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world -- they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-q
Externí odkaz:
http://arxiv.org/abs/2010.10038
Autor:
Selvaraju, Ramprasaath R., Tendulkar, Purva, Parikh, Devi, Horvitz, Eric, Ribeiro, Marco, Nushi, Besmira, Kamar, Ece
Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of que
Externí odkaz:
http://arxiv.org/abs/2001.06927
An approach to make text visually appealing and memorable is semantic reinforcement - the use of visual cues alluding to the context or theme in which the word is being used to reinforce the message (e.g., Google Doodles). We present a computational
Externí odkaz:
http://arxiv.org/abs/1903.07820
Autor:
Selvaraju, Ramprasaath R., Lee, Stefan, Shen, Yilin, Jin, Hongxia, Ghosh, Shalini, Heck, Larry, Batra, Dhruv, Parikh, Devi
Publikováno v:
The IEEE International Conference on Computer Vision (ICCV) 2019
Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Impor
Externí odkaz:
http://arxiv.org/abs/1902.03751