Zobrazeno 1 - 10
of 1 649
pro vyhledávání: '"Wang, Xinlong"'
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has s
Externí odkaz:
http://arxiv.org/abs/2407.20171
Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hing
Externí odkaz:
http://arxiv.org/abs/2407.08303
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual repres
Externí odkaz:
http://arxiv.org/abs/2406.11832
Visual grounding (VG) aims at locating the foreground entities that match the given natural language expressions. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to
Externí odkaz:
http://arxiv.org/abs/2402.11265
Autor:
Sun, Quan, Wang, Jinsheng, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Xinlong
Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-bill
Externí odkaz:
http://arxiv.org/abs/2402.04252
Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an
Externí odkaz:
http://arxiv.org/abs/2401.09417
Autor:
Sun, Quan, Cui, Yufeng, Zhang, Xiaosong, Zhang, Fan, Yu, Qiying, Luo, Zhengxiong, Wang, Yueze, Rao, Yongming, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong
The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-
Externí odkaz:
http://arxiv.org/abs/2312.13286
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizab
Externí odkaz:
http://arxiv.org/abs/2312.09128
Autor:
Wang, Wenxuan, Yue, Tongtian, Zhang, Yisi, Guo, Longteng, He, Xingjian, Wang, Xinlong, Liu, Jing
Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption that one e
Externí odkaz:
http://arxiv.org/abs/2312.08007
Text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models has shown great promise but still suffers from inconsistent 3D geometric structures (Janus problems) and severe artifacts. The aforementioned problems mainly st
Externí odkaz:
http://arxiv.org/abs/2311.17971