Zobrazeno 1 - 10
of 491
pro vyhledávání: '"Dong, Xiaoyi"'
Autor:
Zhang, Pan, Dong, Xiaoyi, Zang, Yuhang, Cao, Yuhang, Qian, Rui, Chen, Lin, Guo, Qipeng, Duan, Haodong, Wang, Bin, Ouyang, Linke, Zhang, Songyang, Zhang, Wenwei, Li, Yining, Gao, Yang, Sun, Peng, Zhang, Xinyue, Li, Wei, Li, Jingwen, Wang, Wenhai, Yan, Hang, He, Conghui, Zhang, Xingcheng, Chen, Kai, Dai, Jifeng, Qiao, Yu, Lin, Dahua, Wang, Jiaqi
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities
Externí odkaz:
http://arxiv.org/abs/2407.03320
Autor:
Ma, Yubo, Zang, Yuhang, Chen, Liangyu, Chen, Meiqi, Jiao, Yizhu, Li, Xinze, Lu, Xinyuan, Liu, Ziyu, Ma, Yan, Dong, Xiaoyi, Zhang, Pan, Pan, Liangming, Jiang, Yu-Gang, Wang, Jiaqi, Cao, Yixin, Sun, Aixin
Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding
Externí odkaz:
http://arxiv.org/abs/2407.01523
Autor:
Liu, Ziyu, Chu, Tao, Zang, Yuhang, Wei, Xilin, Dong, Xiaoyi, Zhang, Pan, Liang, Zijian, Xiong, Yuanjun, Qiao, Yu, Lin, Dahua, Wang, Jiaqi
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios suc
Externí odkaz:
http://arxiv.org/abs/2406.11833
Autor:
Wang, Jiaqi, Zang, Yuhang, Zhang, Pan, Chu, Tao, Cao, Yuhang, Sun, Zeyi, Liu, Ziyu, Dong, Xiaoyi, Wu, Tong, Lin, Dahua, Chen, Zeming, Wang, Zhi, Meng, Lingchen, Yao, Wenhao, Yang, Jianwei, Wu, Sihong, Chen, Zhineng, Wu, Zuxuan, Jiang, Yu-Gang, Wu, Peixi, Chai, Bosong, Nie, Xuan, Yan, Longquan, Wang, Zeyu, Zhou, Qifan, Wang, Boning, Huang, Jiaqi, Xu, Zunnan, Li, Xiu, Yuan, Kehong, Zu, Yanyan, Ha, Jiayao, Gao, Qiong, Jiao, Licheng
Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of publi
Externí odkaz:
http://arxiv.org/abs/2406.11739
Autor:
Ling, Pengyang, Bu, Jiazi, Zhang, Pan, Dong, Xiaoyi, Zang, Yuhang, Wu, Tong, Chen, Huaian, Wang, Jiaqi, Jin, Yi
Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approach
Externí odkaz:
http://arxiv.org/abs/2406.05338
Autor:
Chen, Lin, Wei, Xilin, Li, Jinsong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Chen, Zehui, Duan, Haodong, Lin, Bin, Tang, Zhenyu, Yuan, Li, Qiao, Yu, Lin, Dahua, Zhao, Feng, Wang, Jiaqi
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video
Externí odkaz:
http://arxiv.org/abs/2406.04325
Autor:
Sun, Zeyi, Wu, Tong, Zhang, Pan, Zang, Yuhang, Dong, Xiaoyi, Xiong, Yuanjun, Lin, Dahua, Wang, Jiaqi
Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is t
Externí odkaz:
http://arxiv.org/abs/2406.00093
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length video with a constant number of video tokens streamingly encoded and adaptively selected. The challe
Externí odkaz:
http://arxiv.org/abs/2405.16009
Instruction-based image editing focuses on equipping a generative model with the capacity to adhere to human-written instructions for editing images. Current approaches typically comprehend explicit and specific instructions. However, they often exhi
Externí odkaz:
http://arxiv.org/abs/2405.11190
Autor:
Chen, Zhe, Wang, Weiyun, Tian, Hao, Ye, Shenglong, Gao, Zhangwei, Cui, Erfei, Tong, Wenwen, Hu, Kongzhi, Luo, Jiapeng, Ma, Zheng, Ma, Ji, Wang, Jiaqi, Dong, Xiaoyi, Yan, Hang, Guo, Hewei, He, Conghui, Shi, Botian, Jin, Zhenjiang, Xu, Chao, Wang, Bin, Wei, Xingjian, Li, Wei, Zhang, Wenjian, Zhang, Bo, Cai, Pinlong, Wen, Licheng, Yan, Xiangchao, Dou, Min, Lu, Lewei, Zhu, Xizhou, Lu, Tong, Lin, Dahua, Qiao, Yu, Dai, Jifeng, Wang, Wenhai
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (
Externí odkaz:
http://arxiv.org/abs/2404.16821