PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention
Autor: | Nikolas Ebert, Didier Stricker, Oliver Wasenmüller |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2023 |
Předmět: | |
Zdroj: | Sensors, Vol 23, Iss 7, p 3447 (2023) |
Druh dokumentu: | article |
ISSN: | 1424-8220 |
DOI: | 10.3390/s23073447 |
Popis: | Recently, transformer architectures have shown superior performance compared to their CNN counterparts in many computer vision tasks. The self-attention mechanism enables transformer networks to connect visual dependencies over short as well as long distances, thus generating a large, sometimes even a global receptive field. In this paper, we propose our Parallel Local-Global Vision Transformer (PLG-ViT), a general backbone model that fuses local window self-attention with global self-attention. By merging these local and global features, short- and long-range spatial interactions can be effectively and efficiently represented without the need for costly computational operations such as shifted windows. In a comprehensive evaluation, we demonstrate that our PLG-ViT outperforms CNN-based as well as state-of-the-art transformer-based architectures in image classification and in complex downstream tasks such as object detection, instance segmentation, and semantic segmentation. In particular, our PLG-ViT models outperformed similarly sized networks like ConvNeXt and Swin Transformer, achieving Top-1 accuracy values of 83.4%, 84.0%, and 84.5% on ImageNet-1K with 27M, 52M, and 91M parameters, respectively. |
Databáze: | Directory of Open Access Journals |
Externí odkaz: | |
Nepřihlášeným uživatelům se plný text nezobrazuje | K zobrazení výsledku je třeba se přihlásit. |
načítá se...