Zobrazeno 1 - 10
of 53
pro vyhledávání: '"Pan, Xuran"'
Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationship
Externí odkaz:
http://arxiv.org/abs/2312.10103
Transformers have shown superior performance on various vision tasks. Their large receptive field endows Transformer models with higher representation power than their CNN counterparts. Nevertheless, simply enlarging the receptive field also raises s
Externí odkaz:
http://arxiv.org/abs/2309.01430
The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear complexity by appro
Externí odkaz:
http://arxiv.org/abs/2308.00442
Autor:
Han, Yizeng, Han, Dongchen, Liu, Zeyu, Wang, Yulin, Pan, Xuran, Pu, Yifan, Deng, Chao, Feng, Junlan, Song, Shiji, Huang, Gao
Early exiting has become a promising approach to improving the inference efficiency of deep networks. By structuring models with multiple classifiers (exits), predictions for ``easy'' samples can be generated at earlier exits, negating the need for e
Externí odkaz:
http://arxiv.org/abs/2306.11248
Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT), which enables adaptive feature extraction from global contexts. However, existing self-attention methods either adopt sparse global attention or window
Externí odkaz:
http://arxiv.org/abs/2304.04237
Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation le
Externí odkaz:
http://arxiv.org/abs/2301.07584
Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks. Nevertheless, existing app
Externí odkaz:
http://arxiv.org/abs/2210.08901
Recently, Neural Radiance Fields (NeRF) has shown promising performances on reconstructing 3D scenes and synthesizing novel views from a sparse set of 2D images. Albeit effective, the performance of NeRF is highly influenced by the quality of trainin
Externí odkaz:
http://arxiv.org/abs/2209.08546
Publikováno v:
In Pattern Recognition November 2024 155
Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply enlarging r
Externí odkaz:
http://arxiv.org/abs/2201.00520