Zobrazeno 1 - 10
of 173
pro vyhledávání: '"Fu, Xingyu"'
Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-ag
Externí odkaz:
http://arxiv.org/abs/2406.11243
Autor:
Wang, Fei, Fu, Xingyu, Huang, James Y., Li, Zekun, Liu, Qin, Liu, Xiaogeng, Ma, Mingyu Derek, Xu, Nan, Zhou, Wenxuan, Zhang, Kai, Yan, Tianyi Lorena, Mo, Wenjie Jacky, Liu, Hsiang-Hui, Lu, Pan, Li, Chunyuan, Xiao, Chaowei, Chang, Kai-Wei, Roth, Dan, Zhang, Sheng, Poon, Hoifung, Chen, Muhao
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of
Externí odkaz:
http://arxiv.org/abs/2406.09411
Autor:
Hu, Yushi, Shi, Weijia, Fu, Xingyu, Roth, Dan, Ostendorf, Mari, Zettlemoyer, Luke, Smith, Noah A, Krishna, Ranjay
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are mi
Externí odkaz:
http://arxiv.org/abs/2406.09403
We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that align with commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an ident
Externí odkaz:
http://arxiv.org/abs/2406.07546
Autor:
Fu, Xingyu, Hu, Yushi, Li, Bangzheng, Feng, Yu, Wang, Haoyu, Lin, Xudong, Roth, Dan, Smith, Noah A., Ma, Wei-Chiu, Krishna, Ranjay
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimati
Externí odkaz:
http://arxiv.org/abs/2404.12390
Despite the recent advancement in large language models (LLMs) and their high performances across numerous benchmarks, recent research has unveiled that LLMs suffer from hallucinations and unfaithful reasoning. This work studies a specific type of ha
Externí odkaz:
http://arxiv.org/abs/2311.09702
Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image genera
Externí odkaz:
http://arxiv.org/abs/2310.01596
Autor:
Fu, Xingyu, Xi, Mingze
Frustrating text entry interface has been a major obstacle in participating in social activities in augmented reality (AR). Popular options, such as mid-air keyboard interface, wireless keyboards or voice input, either suffer from poor ergonomic desi
Externí odkaz:
http://arxiv.org/abs/2309.00174
Autor:
Fu, Xingyu, Zhang, Sheng, Kwon, Gukyeong, Perera, Pramuditha, Zhu, Henghui, Zhang, Yuhao, Li, Alexander Hanbo, Wang, William Yang, Wang, Zhiguo, Castelli, Vittorio, Ng, Patrick, Roth, Dan, Xiang, Bing
The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown
Externí odkaz:
http://arxiv.org/abs/2305.18842
Recent advances in multimodal large language models (LLMs) have shown extreme effectiveness in visual question answering (VQA). However, the design nature of these end-to-end models prevents them from being interpretable to humans, undermining trust
Externí odkaz:
http://arxiv.org/abs/2305.14882