Video captioning using transformer network.

Autor: Nechikkat, Mubashira I., Pattilikattil, Bhagyasree V., Varma, Soumya, James, Ajay
Předmět:
Zdroj: AIP Conference Proceedings; 2022, Vol. 2494 Issue 1, p1-7, 7p
Abstrakt: Video Captioning is automatic text description generation process of a given video. It is a type of sequence to sequence translation that should consider the spatial and temporal features of input video data. Recurrent Neural Networks which consist of encoder-decoder architecture are generally used for this type of problem. The model which uses Recurrent Neural Networks suffer from high computational cost and the output generation depends upon the final hidden state vector of the encoder which cannot represent the entire video effectively. Incorporating attention mechanism in the video captioning process can improve the efficiency of caption generation. Nowadays focus has been shifted from RNN based networks to Transformer for video captioning problem. This paper introduces transformer based network architecture over LSTM based models for captioning video. This architecture is generally used in language translation models. Transformer network contains multi-head self-attention and the encoder-decoder attention to improve the performance of text description generation. The model shows BLEU score of 42.4 on MSR-VTT and 53.2 on MSVD datasets which is better than state-of-art models. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index