Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Autor:	Min Yang, Chengming Li, Junhao Liu, Ruifeng Xu
Rok vydání:	2021
Předmět:	Artificial neural network Computer science business.industry Feature extraction 02 engineering and technology Semantics Machine learning computer.software_genre Data modeling Modal 0202 electrical engineering electronic engineering information engineering Media Technology Benchmark (computing) Task analysis 020201 artificial intelligence & image processing Artificial intelligence Electrical and Electronic Engineering Set (psychology) business computer
Zdroj:	IEEE Transactions on Circuits and Systems for Video Technology. 31:3242-3253
ISSN:	1558-2205 1051-8215
Popis:	Cross-modal image-text retrieval has emerged as a challenging task that requires the multimedia system to bridge the heterogeneity gap between different modalities. In this paper, we take full advantage of image-to-text and text-to-image generation models to improve the performance of the cross-modal image-text retrieval model by incorporating the text-grounded and image-grounded generative features into the cross-modal common space with a “Two-Teacher One-Student” learning framework. In addition, a dual regularizer network is designed to distinguish the mismatched image-text pairs from the matched ones. In this way, we can capture the fine-grained correspondence between modalities and distinguish the best-retrieved result from a candidate set. Extensive experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, and MS COCO) show that our model can achieve state-of-the-art cross-modal retrieval results. In particular, our model improves the image-to-text and text-to-image retrieval accuracy by more than 22% over the best competitors on the MS COCO dataset.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::7a8be256aadfbfd801778b9cd1c97077 https://doi.org/10.1109/tcsvt.2020.3037661 Zobrazit plný text záznamu