Autor: |
Zhong, Jingyue, Cao, Yang, Zhu, Yina, Gong, Jie, Chen, Qiaosen |
Předmět: |
|
Zdroj: |
Visual Computer; Dec2023, Vol. 39 Issue 12, p6115-6132, 18p |
Abstrakt: |
Automatically describing the detail and content of the image is a meaningful but difficult task. In this paper, we propose a variety of optimization improvements to enhance the encoder and decoder for image captioning, called multi-channel weighted fusion. In the presented model, we propose multi-channel encoder which is able to extract different features of the same image by combining various models and algorithms. In order to avoid dimensional explosion caused by multi-channel encoder, we employ the reducing multilayer perceptron to reduce the dimension and discuss how to train the reducing multilayer perceptron. For the decoder part, we discuss how the decoder receives features from different channels and propose a technique for fusing independent and identically typed decoders. To get a better description generated by the decoder, we exploit the voting weight strategy for decoder fusion and explore the entropy function to choose the best distribution. The experiment on datasets Flickr-8k, Flickr-30k and MS COCO demonstrates that the proposed model is compatible with most features with low error rate. For instance, our model is specifically outstanding on METEOR score. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|