PERS: Parameter-Efficient Multimodal Transfer Learning for Remote Sensing Visual Question Answering

Autor: Jinlong He, Gang Liu, Pengfei Li, Xiaonan Su, Wenhua Jiang, Dongze Zhang, Shenjun Zhong
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol 17, Pp 14823-14835 (2024)
Druh dokumentu: article
ISSN: 1939-1404
2151-1535
DOI: 10.1109/JSTARS.2024.3447086
Popis: Remote sensing (RS) visual question answering (VQA) provides accurate answers through the analysis of RS images (RSIs) and associated questions. Recent research has increasingly adopted transformers for feature extraction. However, this trend leads to escalating training costs as a consequence of increased model sizes. Furthermore, existing studies predominantly employ transformers to extract features from a single modality, insufficiently integrating multimodal information and thereby undermining the potential advantages of transformers in feature extraction and fusion in these scenarios. To address these challenges, we propose parameter-efficient multimodal transfer learning for RSVQA. We introduce a lightweight, parameter-efficient adapter into the visual feature extraction module, initialized with weights pretrained on large-scale RSIs to reduce both training costs and parameters. A cross-attention mechanism is employed for multimodal interaction, enhancing the integration of information across modalities. Comprehensive experiments were conducted on three datasets: RSVQA-LR, RSVQA-HR, and RSVQAxBEN, achieving state-of-the-art performance. Moreover, exhaustive ablation studies demonstrate that our parameter-efficient adapter strategy achieves performance comparable to full-parameter training under partial parameter conditions, validating the efficacy of our approach.
Databáze: Directory of Open Access Journals