Zobrazeno 1 - 10
of 539
pro vyhledávání: '"Feng, Zhenhua"'
Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder archi
Externí odkaz:
http://arxiv.org/abs/2409.17531
In most existing multi-view modeling scenarios, cross-view correspondence (CVC) between instances of the same target from different views, like paired image-text data, is a crucial prerequisite for effortlessly deriving a consistent representation. N
Externí odkaz:
http://arxiv.org/abs/2409.14882
Autor:
Li, Rongchang, Feng, Zhenhua, Xu, Tianyang, Li, Linze, Wu, Xiao-Jun, Awais, Muhammad, Atito, Sara, Kittler, Josef
Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of
Externí odkaz:
http://arxiv.org/abs/2407.06113
Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks like classification, segmentation and detection. The low-shot learning capability of thes
Externí odkaz:
http://arxiv.org/abs/2406.17460
Masked Image Modeling (MIM)-based models, such as SdAE, CAE, GreenMIM, and MixAE, have explored different strategies to enhance the performance of Masked Autoencoders (MAE) by modifying prediction, loss functions, or incorporating additional architec
Externí odkaz:
http://arxiv.org/abs/2406.17450
Autor:
Tang, Zhangyong, Xu, Tianyang, Feng, Zhenhua, Zhu, Xuefeng, Wang, He, Shao, Pengcheng, Cheng, Chunyang, Wu, Xiao-Jun, Awais, Muhammad, Atito, Sara, Kittler, Josef
RGBT tracking draws increasing attention due to its robustness in multi-modality warranting (MMW) scenarios, such as nighttime and bad weather, where relying on a single sensing modality fails to ensure stable tracking results. However, the existing
Externí odkaz:
http://arxiv.org/abs/2405.00168
Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data. Numerous studies underscore the advantages of MIM, highlighting
Externí odkaz:
http://arxiv.org/abs/2404.00509
Recently, self-supervised metric learning has raised attention for the potential to learn a generic distance function. It overcomes the limitations of conventional supervised one, e.g., scalability and label biases. Despite progress in this domain, c
Externí odkaz:
http://arxiv.org/abs/2312.01118
Autor:
Wu, Cong, Wu, Xiao-Jun, Kittler, Josef, Xu, Tianyang, Atito, Sara, Awais, Muhammad, Feng, Zhenhua
Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of represent
Externí odkaz:
http://arxiv.org/abs/2309.05834
Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have
Externí odkaz:
http://arxiv.org/abs/2308.11448