PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention

Autor:	Nikolas Ebert, Didier Stricker, Oliver Wasenmüller
Jazyk:	angličtina
Rok vydání:	2023
Předmět:	transformer self-attention image classification object detection semantic segmentation Chemical technology TP1-1185
Zdroj:	Sensors, Vol 23, Iss 7, p 3447 (2023)
Druh dokumentu:	article
ISSN:	1424-8220
DOI:	10.3390/s23073447
Popis:	Recently, transformer architectures have shown superior performance compared to their CNN counterparts in many computer vision tasks. The self-attention mechanism enables transformer networks to connect visual dependencies over short as well as long distances, thus generating a large, sometimes even a global receptive field. In this paper, we propose our Parallel Local-Global Vision Transformer (PLG-ViT), a general backbone model that fuses local window self-attention with global self-attention. By merging these local and global features, short- and long-range spatial interactions can be effectively and efficiently represented without the need for costly computational operations such as shifted windows. In a comprehensive evaluation, we demonstrate that our PLG-ViT outperforms CNN-based as well as state-of-the-art transformer-based architectures in image classification and in complex downstream tasks such as object detection, instance segmentation, and semantic segmentation. In particular, our PLG-ViT models outperformed similarly sized networks like ConvNeXt and Swin Transformer, achieving Top-1 accuracy values of 83.4%, 84.0%, and 84.5% on ImageNet-1K with 27M, 52M, and 91M parameters, respectively.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/080a64f825074ea089f9ad7f8687996c Zobrazit plný text záznamu View record in DOAJ Plný text ve formátu PDF Plný text ve formátu HTML
Nepřihlášeným uživatelům se plný text nezobrazuje	K zobrazení výsledku je třeba se přihlásit.