Masked Vision-language Transformer in Fashion

Autor: Ji, Ge-Peng, Zhuge, Mingchen, Gao, Dehong, Fan, Deng-Ping, Sakaridis, Christos, Gool, Luc Van
Zdroj: Machine Intelligence Research; June 2023, Vol. 20 Issue: 3 p421-434, 14p
Abstrakt: We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.
Databáze: Supplemental Index