Zobrazeno 1 - 10
of 2 213
pro vyhledávání: '"Fan,Yue"'
Autor:
Gao, Zhi, Zhang, Bofei, Li, Pengxiang, Ma, Xiaojian, Yuan, Tao, Fan, Yue, Wu, Yuwei, Jia, Yunde, Zhu, Song-Chun, Li, Qing
The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tun
Externí odkaz:
http://arxiv.org/abs/2412.15606
Autor:
Wang, Haiyang, Fan, Yue, Naeem, Muhammad Ferjad, Xian, Yongqin, Lenssen, Jan Eric, Wang, Liwei, Tombari, Federico, Schiele, Bernt
Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily f
Externí odkaz:
http://arxiv.org/abs/2410.23168
Autor:
Fan, Yue, Xian, Yongqin, Zhai, Xiaohua, Kolesnikov, Alexander, Naeem, Muhammad Ferjad, Schiele, Bernt, Tombari, Federico
Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring
Externí odkaz:
http://arxiv.org/abs/2407.00503
Autor:
Fan, Yue, Ding, Lei, Kuo, Ching-Chen, Jiang, Shan, Zhao, Yang, Guan, Xinze, Yang, Jie, Zhang, Yi, Wang, Xin Eric
Graphical User Interfaces (GUIs) are central to our interaction with digital devices and growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: scre
Externí odkaz:
http://arxiv.org/abs/2406.19263
Autor:
He, Xuehai, Feng, Weixi, Zheng, Kaizhi, Lu, Yujie, Zhu, Wanrong, Li, Jiachen, Fan, Yue, Wang, Jianfeng, Li, Linjie, Yang, Zhengyuan, Lin, Kevin, Wang, William Yang, Wang, Lijuan, Wang, Xin Eric
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate ric
Externí odkaz:
http://arxiv.org/abs/2406.08407
Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, with wide-ranging applications in embodied agents and augmented reality systems. Existing methods adopt neurel rendering methods as 3D representations and joi
Externí odkaz:
http://arxiv.org/abs/2403.15624
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relati
Externí odkaz:
http://arxiv.org/abs/2403.11481
Large-scale pre-trained vision models (PVMs) have shown great potential for adaptability across various downstream vision tasks. However, with state-of-the-art PVMs growing to billions or even trillions of parameters, the standard full fine-tuning pa
Externí odkaz:
http://arxiv.org/abs/2402.02242
Autor:
Fan, Yue, Gu, Jing, Zhou, Kaiwen, Yan, Qianqi, Jiang, Shan, Kuo, Ching-Chen, Guan, Xinze, Wang, Xin Eric
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanc
Externí odkaz:
http://arxiv.org/abs/2401.15847
Semi-supervised learning (SSL) methods effectively leverage unlabeled data to improve model generalization. However, SSL models often underperform in open-set scenarios, where unlabeled data contain outliers from novel categories that do not appear i
Externí odkaz:
http://arxiv.org/abs/2311.10572