Zobrazeno 1 - 10
of 93
pro vyhledávání: '"Wang, Xin Eric"'
Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reas
Externí odkaz:
http://arxiv.org/abs/2407.12366
Autor:
Fan, Yue, Ding, Lei, Kuo, Ching-Chen, Jiang, Shan, Zhao, Yang, Guan, Xinze, Yang, Jie, Zhang, Yi, Wang, Xin Eric
Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring tas
Externí odkaz:
http://arxiv.org/abs/2406.19263
Autor:
Gu, Jing, Fang, Yuwei, Skorokhodov, Ivan, Wonka, Peter, Du, Xinya, Tulyakov, Sergey, Wang, Xin Eric
Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to
Externí odkaz:
http://arxiv.org/abs/2406.12831
Autor:
Zhou, Yufan, Zhang, Ruiyi, Zheng, Kaizhi, Zhao, Nanxuan, Gu, Jiuxiang, Wang, Zichao, Wang, Xin Eric, Sun, Tong
In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for
Externí odkaz:
http://arxiv.org/abs/2406.09305
Autor:
He, Xuehai, Feng, Weixi, Zheng, Kaizhi, Lu, Yujie, Zhu, Wanrong, Li, Jiachen, Fan, Yue, Wang, Jianfeng, Li, Linjie, Yang, Zhengyuan, Lin, Kevin, Wang, William Yang, Wang, Lijuan, Wang, Xin Eric
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate ric
Externí odkaz:
http://arxiv.org/abs/2406.08407
Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that
Externí odkaz:
http://arxiv.org/abs/2405.20421
Autor:
He, Xuehai, Zheng, Jian, Fang, Jacob Zhiyuan, Piramuthu, Robinson, Bansal, Mohit, Ordonez, Vicente, Sigurdsson, Gunnar A, Peng, Nanyun, Wang, Xin Eric
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency a
Externí odkaz:
http://arxiv.org/abs/2405.04834
Autor:
Gu, Jing, Wang, Yilin, Zhao, Nanxuan, Xiong, Wei, Liu, Qing, Zhang, Zhifei, Zhang, He, Zhang, Jianming, Jung, HyunJoon, Wang, Xin Eric
Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore,
Externí odkaz:
http://arxiv.org/abs/2404.05717
Autor:
Fan, Yue, Gu, Jing, Zhou, Kaiwen, Yan, Qianqi, Jiang, Shan, Kuo, Ching-Chen, Guan, Xinze, Wang, Xin Eric
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanc
Externí odkaz:
http://arxiv.org/abs/2401.15847
In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) on visual commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision pipelines are good at dif
Externí odkaz:
http://arxiv.org/abs/2310.05872