Abstrakt: |
Deep cross-modal hashing, as a promising fast similarity search technique, has attracted broad interest and obtained great success owing to its outstanding representation capability and computational efficiency. Since the inconsistent feature representations and distributions of different modalities (i.e., image and text), prior studies primarily focus on preserving pairwise similarity with global embedding, but fail to further utilize detailed local representations to effectively align such heterogeneous data to jointly bridge the heterogeneous and semantic gaps across modalities. Meanwhile, typical learning networks can learn only one fixed-length hash code rather than multi-length ones, leading to extremely limited flexibility and scalability. To tackle these issues, this paper proposes a novel Contrastive Multi-bit Collaborative Learning (CMCL) network, which hierarchically aligns both global and local features among different modalities and simultaneously generates multi-length hash codes (i.e., 16-, 32-, 64-bits) in one unified transformer-based framework. Specifically, we design a novel cross-modal contrastive alignment module to simultaneously bridge the heterogeneous and semantic gaps across modalities via global and local contrastive learning. Moreover, we propose a multi-bit collaborative optimization module to synchronously produce multi-length hash codes under the explicit guidance of one auxiliary online hash learner with a longer length (i.e., 128-bit). As such, our CMCL framework can jointly alleviate the heterogeneity among modalities from a hierarchical perspective and collaboratively explore the correlations between multi-bit hash codes, thereby yielding multi-length discriminative hash codes in a one-stop learning manner. Comprehensive experiments demonstrate the consistent superiority of our CMCL in multi-bit hash code learning over the state-of-the-art cross-modal hashing baselines. |