Learning cross-modal correlations by exploring inter-word semantics and stacked co-attention
Autor: | Yanbing Liu, Yue Hu, Zengchang Qin, Yuhang Lu, Weifeng Zhang, Jing Yu |
---|---|
Rok vydání: | 2020 |
Předmět: |
Computer science
Feature extraction 02 engineering and technology Similarity measure computer.software_genre Semantics 01 natural sciences Convolutional neural network Artificial Intelligence 0103 physical sciences Similarity (psychology) 0202 electrical engineering electronic engineering information engineering Feature (machine learning) Word2vec 010306 general physics Modality (human–computer interaction) business.industry Deep learning Signal Processing 020201 artificial intelligence & image processing Computer Vision and Pattern Recognition Artificial intelligence business computer Software Natural language processing |
Zdroj: | Pattern Recognition Letters. 130:189-198 |
ISSN: | 0167-8655 |
DOI: | 10.1016/j.patrec.2018.08.017 |
Popis: | Cross-modal information retrieval aims to find heterogeneous data of various modalities from a given query of one modality. The main challenge is to learn the semantic correlations between different modalities and measure the distance across modalities. For text-image retrieval, existing work mostly uses off-the-shelf Convolutional Neural Network (CNN) for image feature extraction. For texts, word-level features such as bag-of-words or word2vec are employed to build deep learning models to represent texts. Besides word-level semantics, the semantic relations between words are also informative but less explored. In this paper, we explore the inter-word semantics by modelling texts by graphs using similarity measure based on word2vec. Besides feature presentations, we further study the problem of information imbalance between different modalities when describing the same semantics. For example textual descriptions often contain more background information that cannot be conveyed by images and vice versa. We propose a stacked co-attention network to progressively learn the mutually attended features of different modalities and enhance their fine-grained correlations. A dual-path neural network is proposed for cross-modal information retrieval. The model is trained by a pairwise similarity loss function to maximize the similarity of relevant text-image pairs and minimize the similarity of irrelevant pairs. Experimental results show that the proposed model outperforms the state-of-the-art methods significantly, with 19% improvement on accuracy for the best case. |
Databáze: | OpenAIRE |
Externí odkaz: |