Zobrazeno 1 - 10
of 91
pro vyhledávání: '"Guo, Jianyuan"'
Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL models requires
Externí odkaz:
http://arxiv.org/abs/2410.17779
Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While training-based vide
Externí odkaz:
http://arxiv.org/abs/2410.10441
Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approach
Externí odkaz:
http://arxiv.org/abs/2408.06798
Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, a
Externí odkaz:
http://arxiv.org/abs/2406.01210
Autor:
Wang, Chengcheng, Hao, Zhiwei, Tang, Yehui, Guo, Jianyuan, Yang, Yujie, Han, Kai, Wang, Yunhe
Diffusion-based super-resolution (SR) models have recently garnered significant attention due to their potent restoration capabilities. But conventional diffusion models perform noise sampling from a single distribution, constraining their ability to
Externí odkaz:
http://arxiv.org/abs/2402.17133
Autor:
Guo, Jianyuan, Hao, Zhiwei, Wang, Chengcheng, Tang, Yehui, Wu, Han, Hu, Han, Han, Kai, Xu, Chang
Publikováno v:
ICML 2024
Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to out-of-domain t
Externí odkaz:
http://arxiv.org/abs/2402.04841
Recent advancements in large language models have sparked interest in their extraordinary and near-superhuman capabilities, leading researchers to explore methods for evaluating and optimizing these abilities, which is called superalignment. In this
Externí odkaz:
http://arxiv.org/abs/2402.03749
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory and computat
Externí odkaz:
http://arxiv.org/abs/2402.05964
Autor:
Wang, Yunhe, Chen, Hanting, Tang, Yehui, Guo, Tianyu, Han, Kai, Nie, Ying, Wang, Xutao, Hu, Hailin, Bai, Zheyuan, Wang, Yun, Liu, Fangcheng, Liu, Zhicheng, Guo, Jianyuan, Zeng, Sinan, Zhang, Yinchen, Xu, Qinghua, Liu, Qun, Yao, Jun, Xu, Chao, Tao, Dacheng
The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llam
Externí odkaz:
http://arxiv.org/abs/2312.17276
Knowledge distillation~(KD) has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. However, most existing distillation methods are designed under the assumption that the teacher and stu
Externí odkaz:
http://arxiv.org/abs/2310.19444