Výsledky vyhledávání

Report

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Autor: Meng, Fanqing, Wang, Jin, Li, Chuanhao, Lu, Quanfeng, Tian, Hao, Liao, Jiaqi, Zhu, Xizhou, Dai, Jifeng, Qiao, Yu, Luo, Ping, Zhang, Kaipeng, Shao, Wenqi

The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not

Externí odkaz: http://arxiv.org/abs/2408.02718

Zobrazit plný text záznamu

Report

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Autor: Liu, Yangzhou, Cao, Yue, Gao, Zhangwei, Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Tian, Hao, Lu, Lewei, Zhu, Xizhou, Lu, Tong, Qiao, Yu, Dai, Jifeng

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotati

Externí odkaz: http://arxiv.org/abs/2407.15838

Zobrazit plný text záznamu

Report

Observation of Klein bottle quadrupole topological insulators in electric circuits

Autor: Shen, Xizhou, Pan, Keyu, Wang, Xiumei, Zhou, Xingping

The Klein bottle Benalcazar-Bernevig-Hughes (BBH) insulator phase plays a pivotal role in understanding higher-order topological phases. The insulator phase is characterized by a unique feature: a nonsymmorphic glide symmetry that exists within momen

Externí odkaz: http://arxiv.org/abs/2407.07470

Zobrazit plný text záznamu

Report

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

Autor: Liang, Renjie, Li, Li, Zhang, Chongzhi, Wang, Jing, Zhu, Xizhou, Sun, Aixin

In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and stud

Externí odkaz: http://arxiv.org/abs/2407.06597

Zobrazit plný text záznamu

Report

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data ai

Externí odkaz: http://arxiv.org/abs/2406.08418

Zobrazit plný text záznamu

Report

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Autor: Wu, Jiannan, Zhong, Muyan, Xing, Sen, Lai, Zeqiang, Liu, Zhaoyang, Wang, Wenhai, Chen, Zhe, Zhu, Xizhou, Lu, Lewei, Lu, Tong, Luo, Ping, Qiao, Yu, Dai, Jifeng

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broad

Externí odkaz: http://arxiv.org/abs/2406.08394

Zobrazit plný text záznamu

Report

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Autor: Yang, Chenyu, Zhu, Xizhou, Zhu, Jinguo, Su, Weijie, Wang, Junjie, Dong, Xuan, Wang, Wenhai, Lu, Lewei, Li, Bin, Zhou, Jie, Qiao, Yu, Dai, Jifeng

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved imag

Externí odkaz: http://arxiv.org/abs/2406.07543

Zobrazit plný text záznamu

Report

Needle In A Multimodal Haystack

Autor: Wang, Weiyun, Zhang, Shuibo, Ren, Yiming, Duan, Yuchen, Li, Tiantong, Liu, Shuo, Hu, Mengkang, Chen, Zhe, Zhang, Kaipeng, Lu, Lewei, Zhu, Xizhou, Luo, Ping, Qiao, Yu, Dai, Jifeng, Shao, Wenqi, Wang, Wenhai

With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplor

Externí odkaz: http://arxiv.org/abs/2406.07230

Zobrazit plný text záznamu

Report

Learning 1D Causal Visual Representation with De-focus Attention Networks

Autor: Tao, Chenxin, Zhu, Xizhou, Su, Shiqian, Lu, Lewei, Tian, Changyao, Luo, Xuan, Huang, Gao, Li, Hongsheng, Qiao, Yu, Zhou, Jie, Dai, Jifeng

Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in

Externí odkaz: http://arxiv.org/abs/2406.04342

Zobrazit plný text záznamu

Report

Parameter-Inverted Image Pyramid Networks

Autor: Zhu, Xizhou, Yang, Xue, Wang, Zhaokai, Li, Hao, Dou, Wenhan, Ge, Junqi, Lu, Lewei, Qiao, Yu, Dai, Jifeng

Image pyramids are commonly used in modern computer vision tasks to obtain multi-scale features for precise understanding of images. However, image pyramids process multiple resolutions of images using the same large-scale model, which requires signi

Externí odkaz: http://arxiv.org/abs/2406.04330

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání