Zobrazeno 1 - 10
of 1 747
pro vyhledávání: '"An, Xizhou"'
Autor:
Meng, Fanqing, Wang, Jin, Li, Chuanhao, Lu, Quanfeng, Tian, Hao, Liao, Jiaqi, Zhu, Xizhou, Dai, Jifeng, Qiao, Yu, Luo, Ping, Zhang, Kaipeng, Shao, Wenqi
The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not
Externí odkaz:
http://arxiv.org/abs/2408.02718
Autor:
Liu, Yangzhou, Cao, Yue, Gao, Zhangwei, Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Tian, Hao, Lu, Lewei, Zhu, Xizhou, Lu, Tong, Qiao, Yu, Dai, Jifeng
Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotati
Externí odkaz:
http://arxiv.org/abs/2407.15838
The Klein bottle Benalcazar-Bernevig-Hughes (BBH) insulator phase plays a pivotal role in understanding higher-order topological phases. The insulator phase is characterized by a unique feature: a nonsymmorphic glide symmetry that exists within momen
Externí odkaz:
http://arxiv.org/abs/2407.07470
In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and stud
Externí odkaz:
http://arxiv.org/abs/2407.06597
Autor:
Li, Qingyun, Chen, Zhe, Wang, Weiyun, Wang, Wenhai, Ye, Shenglong, Jin, Zhenjiang, Chen, Guanzhou, He, Yinan, Gao, Zhangwei, Cui, Erfei, Yu, Jiashuo, Tian, Hao, Zhou, Jiasheng, Xu, Chao, Wang, Bin, Wei, Xingjian, Li, Wei, Zhang, Wenjian, Zhang, Bo, Cai, Pinlong, Wen, Licheng, Yan, Xiangchao, Li, Zhenxiang, Chu, Pei, Wang, Yi, Dou, Min, Tian, Changyao, Zhu, Xizhou, Lu, Lewei, Chen, Yushi, He, Junjun, Tu, Zhongying, Lu, Tong, Wang, Yali, Wang, Limin, Lin, Dahua, Qiao, Yu, Shi, Botian, He, Conghui, Dai, Jifeng
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data ai
Externí odkaz:
http://arxiv.org/abs/2406.08418
Autor:
Wu, Jiannan, Zhong, Muyan, Xing, Sen, Lai, Zeqiang, Liu, Zhaoyang, Wang, Wenhai, Chen, Zhe, Zhu, Xizhou, Lu, Lewei, Lu, Tong, Luo, Ping, Qiao, Yu, Dai, Jifeng
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broad
Externí odkaz:
http://arxiv.org/abs/2406.08394
Autor:
Yang, Chenyu, Zhu, Xizhou, Zhu, Jinguo, Su, Weijie, Wang, Junjie, Dong, Xuan, Wang, Wenhai, Lu, Lewei, Li, Bin, Zhou, Jie, Qiao, Yu, Dai, Jifeng
Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved imag
Externí odkaz:
http://arxiv.org/abs/2406.07543
Autor:
Wang, Weiyun, Zhang, Shuibo, Ren, Yiming, Duan, Yuchen, Li, Tiantong, Liu, Shuo, Hu, Mengkang, Chen, Zhe, Zhang, Kaipeng, Lu, Lewei, Zhu, Xizhou, Luo, Ping, Qiao, Yu, Dai, Jifeng, Shao, Wenqi, Wang, Wenhai
With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplor
Externí odkaz:
http://arxiv.org/abs/2406.07230
Autor:
Tao, Chenxin, Zhu, Xizhou, Su, Shiqian, Lu, Lewei, Tian, Changyao, Luo, Xuan, Huang, Gao, Li, Hongsheng, Qiao, Yu, Zhou, Jie, Dai, Jifeng
Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in
Externí odkaz:
http://arxiv.org/abs/2406.04342
Autor:
Zhu, Xizhou, Yang, Xue, Wang, Zhaokai, Li, Hao, Dou, Wenhan, Ge, Junqi, Lu, Lewei, Qiao, Yu, Dai, Jifeng
Image pyramids are commonly used in modern computer vision tasks to obtain multi-scale features for precise understanding of images. However, image pyramids process multiple resolutions of images using the same large-scale model, which requires signi
Externí odkaz:
http://arxiv.org/abs/2406.04330