Abstrakt: |
Image captioning aims to generate captions that accurately describe objects, their attributes, and the relationships or interactions within the scene depicted in an image. Traditional attention-based models often struggle to capture higher-order interactions and fail to account for the geometric and positional relationships among visual objects. To address these limitations, we propose a geometric information-driven network called IGINet that introduces a novel attention mechanism, GeoAtt, to enhance image captioning from two key perspectives. First, GeoAtt employs low-rank bilinear pooling to selectively harness visual information and enable multimodal reasoning, effectively capturing inter-modal interactions through spatial and channel-wise attention distributions. Second, to improve geometric representation capabilities, we propose an innovative approach for incorporating normalized geometric features directly into the attention mechanism. The extracted features can be freely downloaded from Mendeley Data database by anyone interested. This integration enables the generation of attention maps that focus on the most relevant image regions during captioning, ensuring the production of precise and context-rich descriptions. The GeoAtt module integrates smoothly the LSTM encoder-decoder frameworks, resulting in notable improvements in performance and efficiency. Extensive experiments on the MSCOCO benchmark dataset demonstrate that our approach substantially improves captioning performance, achieving competitive results compared to contemporary methods. Notably, the BLEU-4 score of 39.9 represents a state-of-the-art result among CNN-LSTM based single-model approaches. The code for our implementation is publicly available at . [ABSTRACT FROM AUTHOR] |