Zobrazeno 1 - 10
of 769
pro vyhledávání: '"Li Juncheng"'
Autor:
Zheng, Haoyu, Zhang, Wenqiao, Lv, Zheqi, Zhong, Yu, Dai, Yang, An, Jianxiang, Shen, Yongliang, Li, Juncheng, Zhang, Dongping, Tang, Siliang, Zhuang, Yueting
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging tas
Externí odkaz:
http://arxiv.org/abs/2412.19978
Autor:
Liu, Jiang, Li, Bolin, Li, Haoyuan, Lin, Tianwei, Zhang, Wenqiao, Zhong, Tao, Yu, Zhelun, Wei, Jinghao, Cheng, Hao, Jiang, Hao, Lv, Zheqi, Li, Juncheng, Tang, Siliang, Zhuang, Yueting
Efficient multimodal large language models (EMLLMs), in contrast to multimodal large language models (MLLMs), reduce model size and computational costs and are often deployed on resource-constrained devices. However, due to data privacy concerns, exi
Externí odkaz:
http://arxiv.org/abs/2412.19684
Autor:
Ge, Zhiqi, Li, Juncheng, Pang, Xinglei, Gao, Minghe, Pan, Kaihang, Lin, Wang, Fei, Hao, Zhang, Wenqiao, Tang, Siliang, Zhuang, Yueting
Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based agents built on Large Language Models (LLMs) often require frequent updat
Externí odkaz:
http://arxiv.org/abs/2412.10342
Autor:
Yu, Qifan, Shen, Zhebei, Yue, Zhongqi, Wu, Yang, Zhang, Wenqiao, Li, Yunfei, Li, Juncheng, Tang, Siliang, Zhuang, Yueting
Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propo
Externí odkaz:
http://arxiv.org/abs/2412.06293
Autor:
Qu, Leigang, Li, Haochuan, Wang, Wenjie, Liu, Xiang, Li, Juncheng, Nie, Liqiang, Chua, Tat-Seng
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in
Externí odkaz:
http://arxiv.org/abs/2412.05818
Autor:
Bai, Jinbin, Chow, Wei, Yang, Ling, Li, Xiangtai, Li, Juncheng, Zhang, Hanwang, Yan, Shuicheng
We present HumanEdit, a high-quality, human-rewarded dataset specifically designed for instruction-guided image editing, enabling precise and diverse image manipulations through open-form language instructions. Previous large-scale editing datasets o
Externí odkaz:
http://arxiv.org/abs/2412.04280
Autor:
Qiu, Haiyi, Gao, Minghe, Qian, Long, Pan, Kaihang, Yu, Qifan, Li, Juncheng, Wang, Wenjie, Tang, Siliang, Zhuang, Yueting, Chua, Tat-Seng
Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-te
Externí odkaz:
http://arxiv.org/abs/2412.00161
Autor:
Yu, Qifan, Chow, Wei, Yue, Zhongqi, Pan, Kaihang, Wu, Yang, Wan, Xiaoyang, Li, Juncheng, Tang, Siliang, Zhang, Hanwang, Zhuang, Yueting
Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data
Externí odkaz:
http://arxiv.org/abs/2411.15738
Autor:
Gao, Minghe, Bu, Wendong, Miao, Bingchen, Wu, Yang, Li, Yunfei, Li, Juncheng, Tang, Siliang, Wu, Qi, Zhuang, Yueting, Wang, Meng
In this paper, we introduce the Generalist Virtual Agent (GVA), an autonomous entity engineered to function across diverse digital platforms and environments, assisting users by executing a variety of tasks. This survey delves into the evolution of G
Externí odkaz:
http://arxiv.org/abs/2411.10943
Autor:
Chow, Wei, Li, Juncheng, Yu, Qifan, Pan, Kaihang, Fei, Hao, Ge, Zhiqi, Yang, Shuai, Tang, Siliang, Zhang, Hanwang, Sun, Qianru
In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak obj
Externí odkaz:
http://arxiv.org/abs/2411.00304