Výsledky vyhledávání

Report

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Autor: Meng, Fanqing, Wang, Jin, Li, Chuanhao, Lu, Quanfeng, Tian, Hao, Liao, Jiaqi, Zhu, Xizhou, Dai, Jifeng, Qiao, Yu, Luo, Ping, Zhang, Kaipeng, Shao, Wenqi

The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not

Externí odkaz: http://arxiv.org/abs/2408.02718

Zobrazit plný text záznamu

Report

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Autor: Liu, Yangzhou, Cao, Yue, Gao, Zhangwei, Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Tian, Hao, Lu, Lewei, Zhu, Xizhou, Lu, Tong, Qiao, Yu, Dai, Jifeng

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotati

Externí odkaz: http://arxiv.org/abs/2407.15838

Zobrazit plný text záznamu

Report

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities

Externí odkaz: http://arxiv.org/abs/2407.03320

Zobrazit plný text záznamu

Report

Hierarchical Memory for Long Video QA

Autor: Wang, Yiqin, Zhang, Haoji, Tang, Yansong, Liu, Yong, Feng, Jiashi, Dai, Jifeng, Jin, Xiaojie

This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging ta

Externí odkaz: http://arxiv.org/abs/2407.00603

Zobrazit plný text záznamu

Report

CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics

Autor: Gao, Jiawei, Wang, Ziqin, Xiao, Zeqi, Wang, Jingbo, Wang, Tai, Cao, Jinkun, Hu, Xiaolin, Liu, Si, Dai, Jifeng, Pang, Jiangmiao

Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large

Externí odkaz: http://arxiv.org/abs/2406.14558

Zobrazit plný text záznamu

Report

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data ai

Externí odkaz: http://arxiv.org/abs/2406.08418

Zobrazit plný text záznamu

Report

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Autor: Wu, Jiannan, Zhong, Muyan, Xing, Sen, Lai, Zeqiang, Liu, Zhaoyang, Wang, Wenhai, Chen, Zhe, Zhu, Xizhou, Lu, Lewei, Lu, Tong, Luo, Ping, Qiao, Yu, Dai, Jifeng

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broad

Externí odkaz: http://arxiv.org/abs/2406.08394

Zobrazit plný text záznamu

Report

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Autor: Zhang, Haoji, Wang, Yiqin, Tang, Yansong, Liu, Yong, Feng, Jiashi, Dai, Jifeng, Jin, Xiaojie

Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common me

Externí odkaz: http://arxiv.org/abs/2406.08085

Zobrazit plný text záznamu

Report

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Autor: Yang, Chenyu, Zhu, Xizhou, Zhu, Jinguo, Su, Weijie, Wang, Junjie, Dong, Xuan, Wang, Wenhai, Lu, Lewei, Li, Bin, Zhou, Jie, Qiao, Yu, Dai, Jifeng

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved imag

Externí odkaz: http://arxiv.org/abs/2406.07543

Zobrazit plný text záznamu

Report

Needle In A Multimodal Haystack

Autor: Wang, Weiyun, Zhang, Shuibo, Ren, Yiming, Duan, Yuchen, Li, Tiantong, Liu, Shuo, Hu, Mengkang, Chen, Zhe, Zhang, Kaipeng, Lu, Lewei, Zhu, Xizhou, Luo, Ping, Qiao, Yu, Dai, Jifeng, Shao, Wenqi, Wang, Wenhai

With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplor

Externí odkaz: http://arxiv.org/abs/2406.07230

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání