Zobrazeno 1 - 10
of 214
pro vyhledávání: '"Zhang Kaipeng"'
In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant
Externí odkaz:
http://arxiv.org/abs/2412.04062
Autor:
Zhou, Pengfei, Peng, Xiaopeng, Song, Jiajun, Li, Chuanhao, Xu, Zhaopan, Yang, Yue, Guo, Ziyao, Zhang, Hao, Lin, Yuqi, He, Yefei, Zhao, Lirui, Liu, Shuo, Li, Tianhua, Xie, Yuxuan, Chang, Xiaojun, Qiao, Yu, Shao, Wenqi, Zhang, Kaipeng
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and genera
Externí odkaz:
http://arxiv.org/abs/2411.18499
Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchma
Externí odkaz:
http://arxiv.org/abs/2410.18071
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a
Externí odkaz:
http://arxiv.org/abs/2410.08695
The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particular
Externí odkaz:
http://arxiv.org/abs/2410.08584
Autor:
Meng, Fanqing, Liao, Jiaqi, Tan, Xinyu, Shao, Wenqi, Lu, Quanfeng, Zhang, Kaipeng, Cheng, Yu, Li, Dianqi, Qiao, Yu, Luo, Ping
Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundatio
Externí odkaz:
http://arxiv.org/abs/2410.05363
Recently, State Space Models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have demonstrated significant potential in computer vision tasks due to their linear computational complexity with respect to token length and their global recept
Externí odkaz:
http://arxiv.org/abs/2410.03174
Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to i
Externí odkaz:
http://arxiv.org/abs/2408.12885
Autor:
Li, Zekai, Guo, Ziyao, Zhao, Wangbo, Zhang, Tianle, Cheng, Zhi-Qi, Khaki, Samir, Zhang, Kaipeng, Sajedi, Ahmad, Plataniotis, Konstantinos N, Wang, Kai, You, Yang
Dataset Distillation aims to compress a large dataset into a significantly more compact, synthetic one without compromising the performance of the trained models. To achieve this, existing methods use the agent model to extract information from the t
Externí odkaz:
http://arxiv.org/abs/2408.03360
Autor:
Meng, Fanqing, Wang, Jin, Li, Chuanhao, Lu, Quanfeng, Tian, Hao, Liao, Jiaqi, Zhu, Xizhou, Dai, Jifeng, Qiao, Yu, Luo, Ping, Zhang, Kaipeng, Shao, Wenqi
The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not
Externí odkaz:
http://arxiv.org/abs/2408.02718