Popis: |
We present a supervised video decaptioning algorithm driven by an encoder-decoder pixel prediction. By analogy with auto-encoders, we use U-Net with stacked dilated Convolution layer which is a convolutional neural network trained to generate the decaptioned version of an arbitrary video with subtitles of any size, colour or background. Moreover, our method doesn’t require the mask of the region with text to be removed. In order to succeed at this task, our model needs to both understand the content of the entire frames of video, as well as produce a visually appealing hypothesis for the missing part behind text overlay. When training with our model, we have experimented with both a standard pixel-wise reconstruction loss, as well as total variation loss. The latter produces much sharper results because it enforces inherent local nature in the generated image. We found that our model learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of including dilated convolution layers and residual connections in the bottleneck layer in the reconstruction of videos without captions. Furthermore, our model can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods. |