Zobrazeno 1 - 10
of 16
pro vyhledávání: '"Li, Zhuowan"'
Compositional visual reasoning methods, which translate a complex query into a structured composition of feasible visual tasks, have exhibited a strong potential in complicated multi-modal tasks. Empowered by recent advances in large language models
Externí odkaz:
http://arxiv.org/abs/2408.02210
Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts di
Externí odkaz:
http://arxiv.org/abs/2407.16833
Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions, current chart visual question answering (chart VQA) models suffer on complex reasoning questi
Externí odkaz:
http://arxiv.org/abs/2403.16385
While Multi-modal Language Models (MLMs) demonstrate impressive multimodal ability, they still struggle on providing factual and precise responses for tasks like visual question answering (VQA). In this paper, we address this challenge from the persp
Externí odkaz:
http://arxiv.org/abs/2312.06685
Despite rapid progress in Visual question answering (VQA), existing datasets and models mainly focus on testing reasoning in 2D. However, it is important that VQA models also understand the 3D structure of visual scenes, for example to support tasks
Externí odkaz:
http://arxiv.org/abs/2310.17914
Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of the visual
Externí odkaz:
http://arxiv.org/abs/2212.00281
Autor:
Li, Zhuowan, Wang, Xingrui, Stengel-Eskin, Elias, Kortylewski, Adam, Ma, Wufei, Van Durme, Benjamin, Yuille, Alan
Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization difficult
Externí odkaz:
http://arxiv.org/abs/2212.00259
Our commonsense knowledge about objects includes their typical visual attributes; we know that bananas are typically yellow or green, and not purple. Text and image corpora, being subject to reporting bias, represent this world-knowledge to varying d
Externí odkaz:
http://arxiv.org/abs/2205.01850
While Visual Question Answering (VQA) has progressed rapidly, previous works raise concerns about robustness of current VQA models. In this work, we study the robustness of VQA models from a novel perspective: visual context. We suggest that the mode
Externí odkaz:
http://arxiv.org/abs/2204.02285
Autor:
Li, Zhuowan, Stengel-Eskin, Elias, Zhang, Yixiao, Xie, Cihang, Tran, Quan, Van Durme, Benjamin, Yuille, Alan
While neural symbolic methods demonstrate impressive performance in visual question answering on synthetic images, their performance suffers on real images. We identify that the long-tail distribution of visual concepts and unequal importance of reas
Externí odkaz:
http://arxiv.org/abs/2110.00519