Masked Vision-language Transformer in Fashion

Autor:	Ge-Peng Ji, Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Christos Sakaridis, Luc Van Gool
Rok vydání:	2023
Předmět:	FOS: Computer and information sciences Computer Science - Computation and Language Computer Vision Transformers Vision and language Masked image reconstruction Fashion Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Data processing computer science ddc:004 Computation and Language (cs.CL)
Zdroj:	Machine Intelligence Research, 20 (3)
ISSN:	2731-5398 2731-538X
Popis:	We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT. Machine Intelligence Research, 20 (3) ISSN:2731-538X ISSN:2731-5398
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3e306a08827b7c465dd657726484ad10 https://doi.org/10.1007/s11633-022-1394-4 Zobrazit plný text záznamu Full text from SpringerLink