Zobrazeno 1 - 9
of 9
pro vyhledávání: '"Kil, Jihyung"'
Autor:
Kil, Jihyung, Mai, Zheda, Lee, Justin, Wang, Zihe, Cheng, Kerrie, Wang, Lemeng, Liu, Ye, Chowdhury, Arpita, Chao, Wei-Lun
The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa
Externí odkaz:
http://arxiv.org/abs/2407.16837
Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning th
Externí odkaz:
http://arxiv.org/abs/2407.00087
Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model's overall accuracy without evaluating it on different reasoning c
Externí odkaz:
http://arxiv.org/abs/2402.11058
Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input, which define the contents and action spaces (
Externí odkaz:
http://arxiv.org/abs/2402.04476
The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In
Externí odkaz:
http://arxiv.org/abs/2401.01614
Autor:
Kil, Jihyung, Changpinyo, Soravit, Chen, Xi, Hu, Hexiang, Goodman, Sebastian, Chao, Wei-Lun, Soricut, Radu
The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this
Externí odkaz:
http://arxiv.org/abs/2209.05534
Publikováno v:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15482-15491
We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short ho
Externí odkaz:
http://arxiv.org/abs/2202.07028
Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples -- there are too many questions one can ask about an image. A
Externí odkaz:
http://arxiv.org/abs/2109.06122
Autor:
Kil, Jihyung, Chao, Wei-Lun
Zero-shot learning aims to recognize unseen objects using their semantic representations. Most existing works use visual attributes labeled by humans, not suitable for large-scale applications. In this paper, we revisit the use of documents as semant
Externí odkaz:
http://arxiv.org/abs/2104.10355