X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

Autor: Cai, Zhaowei, Kwon, Gukyeong, Ravichandran, Avinash, Bas, Erhan, Tu, Zhuowen, Bhotika, Rahul, Soatto, Stefano
Rok vydání: 2022
Předmět:
Druh dokumentu: Working Paper
Popis: In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product operation. The whole network is trained end-to-end, such that the detector is optimized for the vision-language tasks instead of an off-the-shelf component. To overcome the limited size of paired object-language annotations, we leverage other weak types of supervision to expand the knowledge coverage. This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at ~20 frames per second without using any LVIS annotation during training.
Databáze: arXiv