Výsledky vyhledávání - "Wang, Xin Eric"

Report

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Autor: Zhou, Gengze, Hong, Yicong, Wang, Zun, Wang, Xin Eric, Wu, Qi

Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reas

Externí odkaz: http://arxiv.org/abs/2407.12366

Zobrazit plný text záznamu

Report

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

Autor: Fan, Yue, Ding, Lei, Kuo, Ching-Chen, Jiang, Shan, Zhao, Yang, Guan, Xinze, Yang, Jie, Zhang, Yi, Wang, Xin Eric

Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring tas

Externí odkaz: http://arxiv.org/abs/2406.19263

Zobrazit plný text záznamu

Report

VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

Autor: Gu, Jing, Fang, Yuwei, Skorokhodov, Ivan, Wonka, Peter, Du, Xinya, Tulyakov, Sergey, Wang, Xin Eric

Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to

Externí odkaz: http://arxiv.org/abs/2406.12831

Zobrazit plný text záznamu

Report

Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

Autor: Zhou, Yufan, Zhang, Ruiyi, Zheng, Kaizhi, Zhao, Nanxuan, Gu, Jiuxiang, Wang, Zichao, Wang, Xin Eric, Sun, Tong

In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for

Externí odkaz: http://arxiv.org/abs/2406.09305

Zobrazit plný text záznamu

Report

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Autor: He, Xuehai, Feng, Weixi, Zheng, Kaizhi, Lu, Yujie, Zhu, Wanrong, Li, Jiachen, Fan, Yue, Wang, Jianfeng, Li, Linjie, Yang, Zhengyuan, Lin, Kevin, Wang, William Yang, Wang, Lijuan, Wang, Xin Eric

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate ric

Externí odkaz: http://arxiv.org/abs/2406.08407

Zobrazit plný text záznamu

Report

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Autor: Yan, Qianqi, He, Xuehai, Yue, Xiang, Wang, Xin Eric

Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that

Externí odkaz: http://arxiv.org/abs/2405.20421

Zobrazit plný text záznamu

Report

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Autor: He, Xuehai, Zheng, Jian, Fang, Jacob Zhiyuan, Piramuthu, Robinson, Bansal, Mohit, Ordonez, Vicente, Sigurdsson, Gunnar A, Peng, Nanyun, Wang, Xin Eric

Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency a

Externí odkaz: http://arxiv.org/abs/2405.04834

Zobrazit plný text záznamu

Report

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Autor: Gu, Jing, Wang, Yilin, Zhao, Nanxuan, Xiong, Wei, Liu, Qing, Zhang, Zhifei, Zhang, He, Zhang, Jianming, Jung, HyunJoon, Wang, Xin Eric

Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore,

Externí odkaz: http://arxiv.org/abs/2404.05717

Zobrazit plný text záznamu

Report

Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA

Autor: Fan, Yue, Gu, Jing, Zhou, Kaiwen, Yan, Qianqi, Jiang, Shan, Kuo, Ching-Chen, Guan, Xinze, Wang, Xin Eric

Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanc

Externí odkaz: http://arxiv.org/abs/2401.15847

Zobrazit plný text záznamu

Report

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Autor: Zhou, Kaiwen, Lee, Kwonjoon, Misu, Teruhisa, Wang, Xin Eric

In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) on visual commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision pipelines are good at dif

Externí odkaz: http://arxiv.org/abs/2310.05872

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání