Výsledky vyhledávání - "Kil, Jihyung"

Report

CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Autor: Kil, Jihyung, Mai, Zheda, Lee, Justin, Wang, Zihe, Cheng, Kerrie, Wang, Lemeng, Liu, Ye, Chowdhury, Arpita, Chao, Wei-Lun

The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa

Externí odkaz: http://arxiv.org/abs/2407.16837

Zobrazit plný text záznamu

Report

ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

Autor: Byun, Ju-Seung, Chun, Jiyun, Kil, Jihyung, Perrault, Andrew

Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning th

Externí odkaz: http://arxiv.org/abs/2407.00087

Zobrazit plný text záznamu

Report

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Autor: Kil, Jihyung, Tavazoee, Farideh, Kang, Dongyeop, Kim, Joo-Kyung

Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model's overall accuracy without evaluating it on different reasoning c

Externí odkaz: http://arxiv.org/abs/2402.11058

Zobrazit plný text záznamu

Report

Dual-View Visual Contextualization for Web Navigation

Autor: Kil, Jihyung, Song, Chan Hee, Zheng, Boyuan, Deng, Xiang, Su, Yu, Chao, Wei-Lun

Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input, which define the contents and action spaces (

Externí odkaz: http://arxiv.org/abs/2402.04476

Zobrazit plný text záznamu

Report

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Autor: Zheng, Boyuan, Gou, Boyu, Kil, Jihyung, Sun, Huan, Su, Yu

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In

Externí odkaz: http://arxiv.org/abs/2401.01614

Zobrazit plný text záznamu

Report

PreSTU: Pre-Training for Scene-Text Understanding

Autor: Kil, Jihyung, Changpinyo, Soravit, Chen, Xi, Hu, Hexiang, Goodman, Sebastian, Chao, Wei-Lun, Soricut, Radu

The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this

Externí odkaz: http://arxiv.org/abs/2209.05534

Zobrazit plný text záznamu

Report

One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones

Autor: Song, Chan Hee, Kil, Jihyung, Pan, Tai-Yu, Sadler, Brian M., Chao, Wei-Lun, Su, Yu

Publikováno v: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15482-15491

We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short ho

Externí odkaz: http://arxiv.org/abs/2202.07028

Zobrazit plný text záznamu

Report

Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

Autor: Kil, Jihyung, Zhang, Cheng, Xuan, Dong, Chao, Wei-Lun

Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples -- there are too many questions one can ask about an image. A

Externí odkaz: http://arxiv.org/abs/2109.06122

Zobrazit plný text záznamu

Report

Revisiting Document Representations for Large-Scale Zero-Shot Learning

Autor: Kil, Jihyung, Chao, Wei-Lun

Zero-shot learning aims to recognize unseen objects using their semantic representations. Most existing works use visual attributes labeled by humans, not suitable for large-scale applications. In this paper, we revisit the use of documents as semant

Externí odkaz: http://arxiv.org/abs/2104.10355

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání