LLaVA-OneVision: Easy Visual Task Transfer

Autor:	Li, Bo, Zhang, Yuanhan, Guo, Dong, Zhang, Renrui, Li, Feng, Zhang, Hao, Zhang, Kaichen, Zhang, Peiyuan, Li, Yanwei, Liu, Ziwei, Li, Chunyuan
Rok vydání:	2024
Předmět:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Artificial Intelligence Computer Science - Computation and Language
Druh dokumentu:	Working Paper
Popis:	We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos. Comment: Project Homepage: https://llava-vl.github.io/blog/2024-08-05-llava-onevision/
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2408.03326 Zobrazit plný text záznamu View this record from Arxiv