Výsledky vyhledávání - "Wang, Xinlong"

Report

Diffusion Feedback Helps CLIP See Better

Autor: Wang, Wenxuan, Sun, Quan, Zhang, Fan, Tang, Yepeng, Liu, Jing, Wang, Xinlong

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has s

Externí odkaz: http://arxiv.org/abs/2407.20171

Zobrazit plný text záznamu

Report

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Autor: Li, Xiaotong, Zhang, Fan, Diao, Haiwen, Wang, Yueze, Wang, Xinlong, Duan, Ling-Yu

Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hing

Externí odkaz: http://arxiv.org/abs/2407.08303

Zobrazit plný text záznamu

Report

Unveiling Encoder-Free Vision-Language Models

Autor: Diao, Haiwen, Cui, Yufeng, Li, Xiaotong, Wang, Yueze, Lu, Huchuan, Wang, Xinlong

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual repres

Externí odkaz: http://arxiv.org/abs/2406.11832

Zobrazit plný text záznamu

Report

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

Autor: Wang, Wenxuan, Zhang, Yisi, He, Xingjian, Yan, Yichen, Zhao, Zijia, Wang, Xinlong, Liu, Jing

Visual grounding (VG) aims at locating the foreground entities that match the given natural language expressions. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to

Externí odkaz: http://arxiv.org/abs/2402.11265

Zobrazit plný text záznamu

Report

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Autor: Sun, Quan, Wang, Jinsheng, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Xinlong

Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-bill

Externí odkaz: http://arxiv.org/abs/2402.04252

Zobrazit plný text záznamu

Report

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Autor: Zhu, Lianghui, Liao, Bencheng, Zhang, Qian, Wang, Xinlong, Liu, Wenyu, Wang, Xinggang

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an

Externí odkaz: http://arxiv.org/abs/2401.09417

Zobrazit plný text záznamu

Report

Generative Multimodal Models are In-Context Learners

Autor: Sun, Quan, Cui, Yufeng, Zhang, Xiaosong, Zhang, Fan, Yu, Qiying, Luo, Zhengxiong, Wang, Yueze, Rao, Yongming, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong

The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-

Externí odkaz: http://arxiv.org/abs/2312.13286

Zobrazit plný text záznamu

Report

Tokenize Anything via Prompting

Autor: Pan, Ting, Tang, Lulu, Wang, Xinlong, Shan, Shiguang

We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizab

Externí odkaz: http://arxiv.org/abs/2312.09128

Zobrazit plný text záznamu

Report

Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Autor: Wang, Wenxuan, Yue, Tongtian, Zhang, Yisi, Guo, Longteng, He, Xingjian, Wang, Xinlong, Liu, Jing

Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption that one e

Externí odkaz: http://arxiv.org/abs/2312.08007

Zobrazit plný text záznamu

Report

GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation

Autor: Ma, Baorui, Deng, Haoge, Zhou, Junsheng, Liu, Yu-Shen, Huang, Tiejun, Wang, Xinlong

Text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models has shown great promise but still suffers from inconsistent 3D geometric structures (Janus problems) and severe artifacts. The aforementioned problems mainly st

Externí odkaz: http://arxiv.org/abs/2311.17971

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání