Zobrazeno 1 - 2
of 2
pro vyhledávání: '"Shen, Haozhan"'
Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information
Externí odkaz:
http://arxiv.org/abs/2312.15043
Autor:
Zhao, Tiancheng, Zhang, Tianqi, Zhu, Mingwei, Shen, Haozhan, Lee, Kyusong, Lu, Xiaopeng, Yin, Jianwei
Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream t
Externí odkaz:
http://arxiv.org/abs/2207.00221