Video DeCaptioning Using U-Net with Stacked Dilated Convolutional Layers

Autor: Shivansh Mundra, Arnav Kumar Jain, Sayan Sinha
Rok vydání: 2019
Předmět:
Zdroj: Inpainting and Denoising Challenges ISBN: 9783030256135
DOI: 10.1007/978-3-030-25614-2_6
Popis: We present a supervised video decaptioning algorithm driven by an encoder-decoder pixel prediction. By analogy with auto-encoders, we use U-Net with stacked dilated Convolution layer which is a convolutional neural network trained to generate the decaptioned version of an arbitrary video with subtitles of any size, colour or background. Moreover, our method doesn’t require the mask of the region with text to be removed. In order to succeed at this task, our model needs to both understand the content of the entire frames of video, as well as produce a visually appealing hypothesis for the missing part behind text overlay. When training with our model, we have experimented with both a standard pixel-wise reconstruction loss, as well as total variation loss. The latter produces much sharper results because it enforces inherent local nature in the generated image. We found that our model learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of including dilated convolution layers and residual connections in the bottleneck layer in the reconstruction of videos without captions. Furthermore, our model can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Databáze: OpenAIRE