Výsledky vyhledávání

Report

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

Autor: Li, Shiyao, Hu, Yingchun, Ning, Xuefei, Liu, Xihui, Hong, Ke, Jia, Xiaotao, Li, Xiuhong, Yan, Yaqi, Ran, Pei, Dai, Guohao, Yan, Shengen, Yang, Huazhong, Wang, Yu

Vision-Language Models (VLMs) have enabled a variety of real-world applications. The large parameter size of VLMs brings large memory and computation overhead which poses significant challenges for deployment. Post-Training Quantization (PTQ) is an e

Externí odkaz: http://arxiv.org/abs/2412.19509

Zobrazit plný text záznamu

Report

Parallelized Autoregressive Visual Generation

Autor: Wang, Yuqing, Ren, Shuhuai, Lin, Zhijie, Han, Yujin, Guo, Haoyuan, Yang, Zhenheng, Zou, Difan, Feng, Jiashi, Liu, Xihui

Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized

Externí odkaz: http://arxiv.org/abs/2412.15119

Zobrazit plný text záznamu

Report

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Autor: Ge, Junqi, Chen, Ziyi, Lin, Jintao, Zhu, Jinguo, Liu, Xihui, Dai, Jifeng, Zhu, Xizhou

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents. In our

Externí odkaz: http://arxiv.org/abs/2412.09616

Zobrazit plný text záznamu

Report

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Autor: Qiu, Lu, Ge, Yuying, Chen, Yi, Ge, Yixiao, Shan, Ying, Liu, Xihui

The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achi

Externí odkaz: http://arxiv.org/abs/2412.04447

Zobrazit plný text záznamu

Report

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

Autor: Chen, Yi, Ge, Yuying, Li, Yizhuo, Ge, Yixiao, Ding, Mingyu, Shan, Ying, Liu, Xihui

Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been cons

Externí odkaz: http://arxiv.org/abs/2412.04445

Zobrazit plný text záznamu

Report

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Autor: Huang, Kaiyi, Huang, Yukun, Ning, Xuefei, Lin, Zinan, Wang, Yu, Liu, Xihui

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dyn

Externí odkaz: http://arxiv.org/abs/2412.04440

Zobrazit plný text záznamu

Report

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Autor: Huang, Zehuan, Guo, Yuan-Chen, An, Xingqiao, Yang, Yunhan, Li, Yangguang, Zou, Zi-Xin, Liang, Ding, Liu, Xihui, Cao, Yan-Pei, Sheng, Lu

This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generati

Externí odkaz: http://arxiv.org/abs/2412.03558

Zobrazit plný text záznamu

Report

SAMPart3D: Segment Any Part in 3D Objects

Autor: Yang, Yunhan, Huang, Yukun, Guo, Yuan-Chen, Lu, Liangjun, Wu, Xiaoyang, Lam, Edmund Y., Cao, Yan-Pei, Liu, Xihui

3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role in applications such as robotics, 3D generation, and 3D editing. Recent methods harness the powerful Vision Language Models (VLMs) for 2D-to-3D knowledge di

Externí odkaz: http://arxiv.org/abs/2411.07184

Zobrazit plný text záznamu

Report

WorldSimBench: Towards Video Generation Models as World Simulators

Autor: Qin, Yiran, Shi, Zhelun, Yu, Jiwen, Wang, Xijun, Zhou, Enshen, Li, Lijun, Yin, Zhenfei, Liu, Xihui, Sheng, Lu, Shao, Jing, Bai, Lei, Ouyang, Wanli, Zhang, Ruimao

Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predic

Externí odkaz: http://arxiv.org/abs/2410.18072

Zobrazit plný text záznamu

Report

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Autor: Fang, Rongyao, Duan, Chengqi, Wang, Kun, Li, Hao, Tian, Hao, Zeng, Xingyu, Zhao, Rui, Dai, Jifeng, Li, Hongsheng, Liu, Xihui

Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. Howev

Externí odkaz: http://arxiv.org/abs/2410.13861

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání