PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention

Autor: Nikolas Ebert, Didier Stricker, Oliver Wasenmüller
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: Sensors, Vol 23, Iss 7, p 3447 (2023)
Druh dokumentu: article
ISSN: 1424-8220
DOI: 10.3390/s23073447
Popis: Recently, transformer architectures have shown superior performance compared to their CNN counterparts in many computer vision tasks. The self-attention mechanism enables transformer networks to connect visual dependencies over short as well as long distances, thus generating a large, sometimes even a global receptive field. In this paper, we propose our Parallel Local-Global Vision Transformer (PLG-ViT), a general backbone model that fuses local window self-attention with global self-attention. By merging these local and global features, short- and long-range spatial interactions can be effectively and efficiently represented without the need for costly computational operations such as shifted windows. In a comprehensive evaluation, we demonstrate that our PLG-ViT outperforms CNN-based as well as state-of-the-art transformer-based architectures in image classification and in complex downstream tasks such as object detection, instance segmentation, and semantic segmentation. In particular, our PLG-ViT models outperformed similarly sized networks like ConvNeXt and Swin Transformer, achieving Top-1 accuracy values of 83.4%, 84.0%, and 84.5% on ImageNet-1K with 27M, 52M, and 91M parameters, respectively.
Databáze: Directory of Open Access Journals
Nepřihlášeným uživatelům se plný text nezobrazuje