Zobrazeno 1 - 10
of 47
pro vyhledávání: '"Tian, Yunjie"'
Autor:
Ma, Tianren, Xie, Lingxi, Tian, Yunjie, Yang, Boyu, Zhang, Yuan, Doermann, David, Ye, Qixiang
An essential topic for multimodal large language models (MLLMs) is aligning vision and language concepts at a finer level. In particular, we devote efforts to encoding visual referential information for tasks such as referring and grounding. Existing
Externí odkaz:
http://arxiv.org/abs/2406.11327
Autor:
Qiu, Jihao, Zhang, Yuan, Tang, Xi, Xie, Lingxi, Ma, Tianren, Yan, Pengyu, Doermann, David, Ye, Qixiang, Tian, Yunjie
Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we
Externí odkaz:
http://arxiv.org/abs/2406.00258
A fundamental problem in learning robust and expressive visual representations lies in efficiently estimating the spatial relationships of visual semantics throughout the entire image. In this study, we propose vHeat, a novel vision backbone model th
Externí odkaz:
http://arxiv.org/abs/2405.16555
Autor:
Tian, Yunjie, Ma, Tianren, Xie, Lingxi, Qiu, Jihao, Tang, Xi, Zhang, Yuan, Jiao, Jianbin, Tian, Qi, Ye, Qixiang
In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language mo
Externí odkaz:
http://arxiv.org/abs/2401.13307
Autor:
Liu, Yue, Tian, Yunjie, Zhao, Yuzhong, Yu, Hongtian, Xie, Lingxi, Wang, Yaowei, Ye, Qixiang, Liu, Yunfan
Designing computationally efficient network architectures persists as an ongoing necessity in computer vision. In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity. At t
Externí odkaz:
http://arxiv.org/abs/2401.10166
Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invari
Externí odkaz:
http://arxiv.org/abs/2308.10561
We propose integrally pre-trained transformer pyramid network (iTPN), towards jointly optimizing the network backbone and the neck, so that transfer gap between representation models and downstream tasks is minimal. iTPN is born with two elaborated d
Externí odkaz:
http://arxiv.org/abs/2211.12735
Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encod
Externí odkaz:
http://arxiv.org/abs/2205.14949
Autor:
Tian, Yunjie, Xie, Lingxi, Fang, Jiemin, Shi, Mengnan, Peng, Junran, Zhang, Xiaopeng, Jiao, Jianbin, Tian, Qi, Ye, Qixiang
The past year has witnessed a rapid development of masked image modeling (MIM). MIM is mostly built upon the vision transformers, which suggests that self-supervised visual representations can be done by masking input image parts while requiring the
Externí odkaz:
http://arxiv.org/abs/2203.14313
The existing neural architecture search algorithms are mostly working on search spaces with short-distance connections. We argue that such designs, though safe and stable, obstacles the search algorithms from exploring more complicated scenarios. In
Externí odkaz:
http://arxiv.org/abs/2112.02488