Towards Language-Guided Visual Recognition via Dynamic Convolutions.

Autor: Luo, Gen, Zhou, Yiyi, Sun, Xiaoshuai, Wu, Yongjian, Gao, Yue, Ji, Rongrong
Předmět:
Zdroj: International Journal of Computer Vision; Jan2024, Vol. 132 Issue 1, p1-19, 19p
Abstrakt: In this paper, we are committed to establishing a unified and end-to-end multi-modal network via exploring language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-guided Dynamic Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build a fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on seven benchmark datasets of three vision-and-language tasks, i.e., visual question answering, referring expression comprehension and segmentation. The experimental results not only show the competitive or better performance of LaConvNet against existing multi-modal networks, but also witness the merits of LaConvNet as an unified structure, including compact network, low computational cost and high generalization ability. Our source code is released in SimREC project: https://github.com/luogen1996/LaConvNet. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index
Nepřihlášeným uživatelům se plný text nezobrazuje