Autor: |
Li, Jiachen, Xie, Qing, Chang, Xiaojun, Xu, Jinyu, Liu, Yongjian |
Zdroj: |
ACM Transactions on Multimedia Computing, Communications & Applications; Dec2024, Vol. 20 Issue 12, p1-18, 18p |
Abstrakt: |
Referring image segmentation aims to locate and segment the target region based on a given textual expression query. The primary challenge is to understand semantics from visual and textual modalities and achieve alignment and matching. Prior works have attempted to address this challenge by leveraging separately pretrained unimodal models to extract global visual and textual features and perform straightforward fusion to establish cross-modal semantic associations. However, these methods often concentrate solely on the global semantics, disregarding the hierarchical semantics of expression and image and struggling with complex and open real scenarios, thus failing to capture critical cross-modal information. To address these limitations, this article introduces an innovative mutually-guided hierarchical multi-modal feature learning scheme. By leveraging the guidance of global visual features, the model mines hierarchical text features from different stages of the text encoder. Simultaneously, the guidance of global textual features is leveraged to aggregate multi-scale visual features. This mutually guided hierarchical feature learning effectively addresses the semantically inaccurate cause by free-form text and naturally occurring scale variations. Furthermore, a Segment Detail Refinement (SDR) module is designed to enhance the model's spatial detail awareness through attention mapping of low-level visual features and cross-modal features. To evaluate the effectiveness of the proposed approach, extensive experiments are conducted on three widely used referring image object segmentation datasets. The results demonstrate the superiority of the presented method in accurately locating and segmenting objects in images. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|