Autor: |
Yu, Ying, Wang, Jinghui, Pedrycz, Witold, Miao, Duoqian, Qian, Jin |
Předmět: |
|
Zdroj: |
Applied Intelligence; Sep2024, Vol. 54 Issue 17/18, p8108-8119, 12p |
Abstrakt: |
Compared to traditional image recognition, Fine-Grained Image Recognition (FGIR) faces significant challenges due to the subtle distinctions among different categories and the notable variances within the same category. Furthermore, the complexity of backgrounds and the extraction of discriminative features limited to small local regions further exacerbate the difficulty. Recently, several studies have demonstrated the effectiveness of the Vision Transformer (ViT) in FGIR. However, these investigations have frequently overlooked critical information embedded within class tokens across different layers, while also neglecting the subtle local details hidden within patch tokens. To address these issues and enhance FGIR performance, we introduce a novel ViT-based network architecture MIFBF. The proposed model builds upon ViT by incorporating three modules: Complementary Class Tokens Combination module (CCTC), Patches Information Integration module (PII), and Attention Cropping Module (ACM). The CCTC module integrates multi-layer class tokens to capture complementary information, thereby enhancing the model's representational capacity. The PII module delves into the rich local details encoded in patch tokens to improve classification accuracy. The ACM module generates regions of interest based on ViT's self-attention weights and effectively filters background noise, thereby directing the model's attention to the most relevant image areas. Experiments conducted on three different datasets validate the effectiveness of the proposed model, yielding state-of-the-art results and highlighting its superiority in FGIR tasks. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|