Autor: |
SHIQI SUN, DANLAN HUANG, XIAOMING TAO, CHENGKANG PAN, GUANGYI LIU, CHANGWEN CHEN |
Předmět: |
|
Zdroj: |
ACM Transactions on Multimedia Computing, Communications & Applications; Feb2024, Vol. 20 Issue 2, p1-24, 24p |
Abstrakt: |
Scene graph generation (SGG) has been developed to detect objects and their relationships from the visual data and has attracted increasing attention in recent years. Existing works have focused on extracting object context for SGG. However, very few works have attempted to exploit implicit contextual correlations among relationships of the objects. Furthermore, most existing SGG schemes rely on high-level features to predict the predicates while overlooking the potential inherent association of low-level features with the object relationships. We present in this article a novel scheme to capture enhanced contextual information for both objects and relationships. We design a Dual-branch Context Analysis Transformer (DCAT) architecture to extract both object context and relationship context from the visual data with dual transformer branches and then effectively fuse both high-level and low-level features by an adaptive approach to facilitate relationship prediction. Specifically, we first conduct feature representation learning to enrich relation representations by the visual, spatial, and linguistic feature extractors. Next, two transformer branches are designed to leverage the modeling of global associative interaction and mine the hidden association among objects and relationships. Then, we devise a novel feature disentangling method to decouple contextualized high-level features with guidance from the visual semantics. Finally, we develop a refined attention module to perform low-level feature recalibration for the refinement of the final predicate prediction. Experiments on Visual Genome and Action Genome datasets demonstrate the effectiveness of DCAT for both image and video SGG settings. Moreover, we also test the quality of the generated image scene graphs to verify the generalizability on downstream tasks like sentence-to-graph retrieval and image retrieval. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|