Video Captioning via Sentence Augmentation and Spatio-Temporal Attention

Autor: Kuo-Hao Zeng, Tseng-Hung Chen, Wan-Ting Hsu, Min Sun
Rok vydání: 2017
Předmět:
Zdroj: Computer Vision – ACCV 2016 Workshops ISBN: 9783319544069
ACCV Workshops (1)
DOI: 10.1007/978-3-319-54407-6_18
Popis: Generating video descriptions has many important applications such as human-robot interaction, video indexing, video summarization and assisting for the visually impaired. Many significant breakthroughs in deep learning and releases of large-scale open-domain video description datasets allow us to explore this task more effectively. Recently, Venugopalan et al. (S2VT) propose to caption a video via the technique on machine translation. We propose tracklet attention method to capture spatio-temporal information in the decoding phase and reserve the encoding phase similar to S2VT to retain the technique on machine translation. On the other hand, labels for video captioning are expensive and scarce, and training corpus is hard to completely cover rare words presenting in testing set. Hence, we propose to use sentence augmentation method to enrich our training corpus. Finally, we conduct experiments to demonstrate that tracklet attention and sentence augmentation improve the performance of S2VT on the validation set of Microsoft Research Video to Text dataset (MSR-VTT). In addition, we also achieve the state-of-the-art performance on Video Titles in the Wild dataset (VTW) by applying tracklet attention.
Databáze: OpenAIRE