Výsledky vyhledávání

Report

FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation

Autor: Li, Bangzheng, Zhou, Ben, Fu, Xingyu, Wang, Fei, Roth, Dan, Chen, Muhao

Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-ag

Externí odkaz: http://arxiv.org/abs/2406.11243

Zobrazit plný text záznamu

Report

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of

Externí odkaz: http://arxiv.org/abs/2406.09411

Zobrazit plný text záznamu

Report

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Autor: Hu, Yushi, Shi, Weijia, Fu, Xingyu, Roth, Dan, Ostendorf, Mari, Zettlemoyer, Luke, Smith, Noah A, Krishna, Ranjay

Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are mi

Externí odkaz: http://arxiv.org/abs/2406.09403

Zobrazit plný text záznamu

Report

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Autor: Fu, Xingyu, He, Muyu, Lu, Yujie, Wang, William Yang, Roth, Dan

We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that align with commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an ident

Externí odkaz: http://arxiv.org/abs/2406.07546

Zobrazit plný text záznamu

Report

BLINK: Multimodal Large Language Models Can See but Not Perceive

Autor: Fu, Xingyu, Hu, Yushi, Li, Bangzheng, Feng, Yu, Wang, Haoyu, Lin, Xudong, Roth, Dan, Smith, Noah A., Ma, Wei-Chiu, Krishna, Ranjay

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimati

Externí odkaz: http://arxiv.org/abs/2404.12390

Zobrazit plný text záznamu

Report

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

Autor: Li, Bangzheng, Zhou, Ben, Wang, Fei, Fu, Xingyu, Roth, Dan, Chen, Muhao

Despite the recent advancement in large language models (LLMs) and their high performances across numerous benchmarks, recent research has unveiled that LLMs suffer from hallucinations and unfaithful reasoning. This work studies a specific type of ha

Externí odkaz: http://arxiv.org/abs/2311.09702

Zobrazit plný text záznamu

Report

ImagenHub: Standardizing the evaluation of conditional image generation models

Autor: Ku, Max, Li, Tianle, Zhang, Kai, Lu, Yujie, Fu, Xingyu, Zhuang, Wenwen, Chen, Wenhu

Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image genera

Externí odkaz: http://arxiv.org/abs/2310.01596

Zobrazit plný text záznamu

Report

Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality

Autor: Fu, Xingyu, Xi, Mingze

Frustrating text entry interface has been a major obstacle in participating in social activities in augmented reality (AR). Popular options, such as mid-air keyboard interface, wireless keyboards or voice input, either suffer from poor ergonomic desi

Externí odkaz: http://arxiv.org/abs/2309.00174

Zobrazit plný text záznamu

Report

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Autor: Fu, Xingyu, Zhang, Sheng, Kwon, Gukyeong, Perera, Pramuditha, Zhu, Henghui, Zhang, Yuhao, Li, Alexander Hanbo, Wang, William Yang, Wang, Zhiguo, Castelli, Vittorio, Ng, Patrick, Roth, Dan, Xiang, Bing

The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown

Externí odkaz: http://arxiv.org/abs/2305.18842

Zobrazit plný text záznamu

Report

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

Autor: Fu, Xingyu, Zhou, Ben, Chen, Sihao, Yatskar, Mark, Roth, Dan

Recent advances in multimodal large language models (LLMs) have shown extreme effectiveness in visual question answering (VQA). However, the design nature of these end-to-end models prevents them from being interpretable to humans, undermining trust

Externí odkaz: http://arxiv.org/abs/2305.14882

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání