Výsledky vyhledávání - "Dong, Xiaoyi"

Report

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Autor: Ma, Yubo, Zang, Yuhang, Chen, Liangyu, Chen, Meiqi, Jiao, Yizhu, Li, Xinze, Lu, Xinyuan, Liu, Ziyu, Ma, Yan, Dong, Xiaoyi, Zhang, Pan, Pan, Liangming, Jiang, Yu-Gang, Wang, Jiaqi, Cao, Yixin, Sun, Aixin

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding

Externí odkaz: http://arxiv.org/abs/2407.01523

Zobrazit plný text záznamu

Report

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Autor: Liu, Ziyu, Chu, Tao, Zang, Yuhang, Wei, Xilin, Dong, Xiaoyi, Zhang, Pan, Liang, Zijian, Xiong, Yuanjun, Qiao, Yu, Lin, Dahua, Wang, Jiaqi

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios suc

Externí odkaz: http://arxiv.org/abs/2406.11833

Zobrazit plný text záznamu

Report

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results

Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of publi

Externí odkaz: http://arxiv.org/abs/2406.11739

Zobrazit plný text záznamu

Report

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Autor: Ling, Pengyang, Bu, Jiazi, Zhang, Pan, Dong, Xiaoyi, Zang, Yuhang, Wu, Tong, Chen, Huaian, Wang, Jiaqi, Jin, Yi

Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approach

Externí odkaz: http://arxiv.org/abs/2406.05338

Zobrazit plný text záznamu

Report

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Autor: Chen, Lin, Wei, Xilin, Li, Jinsong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Chen, Zehui, Duan, Haodong, Lin, Bin, Tang, Zhenyu, Yuan, Li, Qiao, Yu, Lin, Dahua, Zhao, Feng, Wang, Jiaqi

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video

Externí odkaz: http://arxiv.org/abs/2406.04325

Zobrazit plný text záznamu

Report

Bootstrap3D: Improving 3D Content Creation with Synthetic Data

Autor: Sun, Zeyi, Wu, Tong, Zhang, Pan, Zang, Yuhang, Dong, Xiaoyi, Xiong, Yuanjun, Lin, Dahua, Wang, Jiaqi

Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is t

Externí odkaz: http://arxiv.org/abs/2406.00093

Zobrazit plný text záznamu

Report

Streaming Long Video Understanding with Large Language Models

Autor: Qian, Rui, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Ding, Shuangrui, Lin, Dahua, Wang, Jiaqi

This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length video with a constant number of video tokens streamingly encoded and adaptively selected. The challe

Externí odkaz: http://arxiv.org/abs/2405.16009

Zobrazit plný text záznamu

Report

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

Autor: Jin, Ying, Ling, Pengyang, Dong, Xiaoyi, Zhang, Pan, Wang, Jiaqi, Lin, Dahua

Instruction-based image editing focuses on equipping a generative model with the capacity to adhere to human-written instructions for editing images. Current approaches typically comprehend explicit and specific instructions. However, they often exhi

Externí odkaz: http://arxiv.org/abs/2405.11190

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání