Výsledky vyhledávání

Report

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Autor: Zhang, Jieyu, Xue, Le, Song, Linxin, Wang, Jun, Huang, Weikai, Shu, Manli, Yan, An, Ma, Zixian, Niebles, Juan Carlos, savarese, silvio, Xiong, Caiming, Chen, Zeyuan, Krishna, Ranjay, Xu, Ran

With the rise of multimodal applications, instruction data has become critical for training multimodal language models capable of understanding complex image-based queries. Existing practices rely on powerful but costly large language models (LLMs) o

Externí odkaz: http://arxiv.org/abs/2412.07012

Zobrazit plný text záznamu

Report

Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Autor: Du, Yuzhen, Hu, Teng, Zhang, Jiangning, Xu, Ran Yi Chengming, Hu, Xiaobin, Wu, Kai, Luo, Donghao, Wang, Yabiao, Ma, Lizhuang

Image restoration (IR) aims to recover high-quality images from degraded inputs, with recent deep learning advancements significantly enhancing performance. However, existing methods lack a unified training benchmark for iterations and configurations

Externí odkaz: http://arxiv.org/abs/2412.03814

Zobrazit plný text záznamu

Report

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Autor: Awadalla, Anas, Xue, Le, Shu, Manli, Yan, An, Wang, Jun, Purushwalkam, Senthil, Shen, Sheng, Lee, Hannah, Lo, Oscar, Park, Jae Sung, Guha, Etash, Savarese, Silvio, Schmidt, Ludwig, Choi, Yejin, Xiong, Caiming, Xu, Ran

We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually

Externí odkaz: http://arxiv.org/abs/2411.07461

Zobrazit plný text záznamu

Report

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Autor: Xu, Ran, Liu, Hui, Nag, Sreyashi, Dai, Zhenwei, Xie, Yaochen, Tang, Xianfeng, Luo, Chen, Li, Yang, Ho, Joyce C., Yang, Carl, He, Qi

Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine po

Externí odkaz: http://arxiv.org/abs/2410.17952

Zobrazit plný text záznamu

Report

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Autor: Ryoo, Michael S., Zhou, Honglu, Kendre, Shrikant, Qin, Can, Xue, Le, Shu, Manli, Savarese, Silvio, Xu, Ran, Xiong, Caiming, Niebles, Juan Carlos

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventio

Externí odkaz: http://arxiv.org/abs/2410.16267

Zobrazit plný text záznamu

Report

Trust but Verify: Programmatic VLM Evaluation in the Wild

Autor: Prabhu, Viraj, Purushwalkam, Senthil, Yan, An, Xiong, Caiming, Xu, Ran

Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually veri

Externí odkaz: http://arxiv.org/abs/2410.13121

Zobrazit plný text záznamu

Akademický článek

Analysis on mechanical properties and evolution of mesostructure of soil–rock mixture samples from contact network perspective

Autor: Xu, Ran, Liu, Enlong, Xing, Huilin

Publikováno v: Comptes Rendus. Mécanique, Vol 349, Iss 1, Pp 83-102 (2021)

Based on discrete element method (DEM), three kinds of soil–rock mixture (SRM) models with different coarse particle contents were established and triaxial compression tests were carried out. The results show that the force chains in the particle s

Externí odkaz: https://doaj.org/article/e8d88f3db9eb478b8bfa1415b2b3eddd

Zobrazit plný text záznamu

Report

xLAM: A Family of Large Action Models to Empower AI Agent Systems

Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality

Externí odkaz: http://arxiv.org/abs/2409.03215

Zobrazit plný text záznamu

Report

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Autor: Qin, Can, Xia, Congying, Ramakrishnan, Krithika, Ryoo, Michael, Tu, Lifu, Feng, Yihao, Shu, Manli, Zhou, Honglu, Awadalla, Anas, Wang, Jun, Purushwalkam, Senthil, Xue, Le, Zhou, Yingbo, Wang, Huan, Savarese, Silvio, Niebles, Juan Carlos, Chen, Zeyuan, Xu, Ran, Xiong, Caiming

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and i

Externí odkaz: http://arxiv.org/abs/2408.12590

Zobrazit plný text záznamu

Report

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, s

Externí odkaz: http://arxiv.org/abs/2408.08872

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání