Abstrakt: |
Text-to-video synthesis has garnered significant attention as a challenging task in the domain of vision computing. With the advent of unsupervised learning techniques, text-to-video synthesis has become more feasible. In this context, Generative Adversarial Network (GAN)-based training networks have emerged as the leading unsupervised deep learning methods, exhibiting promising results. However, achieving visual quality, temporal coherence, and semantic consistency between the generated video and textual descriptions remains a considerable challenge. In this paper, we propose a novel approach called Video-Text Matcher (VTM) based GAN for text-to-video synthesis. The proposed VTM is based on Contrastive Language-Image Pre-training (CLIP) but with modifications. It incorporates both global sentence-level and fine-grained word-level information to calculate the similarity between the generated video and the provided textual descriptions. Unlike CLIP, which focuses on matching losses at the global sentence-image level only, our VTM includes a word-region level loss to enhance the fine granularity consistency between the text and video. We evaluate our proposed approach using the Single Digit Bouncing MNIST GIFs (SBMG) dataset and conduct both qualitative and quantitative analyses. The results demonstrate that our proposed method generates appealing videos that align well with the given textual descriptions, showcasing the effectiveness of our approach for text-to-video synthesis. |