Boosting cross-modal retrieval in remote sensing via a novel unified attention network.

Autor: Choudhury S; Indian Institute of Technology, Bombay, India. Electronic address: shabnamchoudhury@iitb.ac.in., Saini D; Indian Institute of Technology, Bombay, India., Banerjee B; Indian Institute of Technology, Bombay, India; Centre for Machine Intelligence and Data Science (C-MInDS), Bombay, India.
Jazyk: angličtina
Zdroj: Neural networks : the official journal of the International Neural Network Society [Neural Netw] 2024 Dec; Vol. 180, pp. 106718. Date of Electronic Publication: 2024 Sep 11.
DOI: 10.1016/j.neunet.2024.106718
Abstrakt: With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, SPCA-Net to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, label-specific supervised contrastive loss, tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.
(Copyright © 2024 Elsevier Ltd. All rights reserved.)
Databáze: MEDLINE