Autor: |
VIGNESH, C., KUMAR, J. YASWANTH, MAYURANATHAN, M., KUMAR, J. SUNIL |
Předmět: |
|
Zdroj: |
i-Manager's Journal of Pattern Recognition; Jan-Jun2021, Vol. 8 Issue 1, p19-24, 6p |
Abstrakt: |
The provision of methods to support audiovisual interactions with growing volumes of video data is an increasingly important challenge for data processing. Currently, there has been some success in generating lip movements using speech or generating a talking face. Among them, talking face generation aims to get realistic talking heads synchronized with the audio or text input. This task requires mining the connection between audio signal/text and lip-sync video frames and ensures the temporal continuity between frames. Thanks to the problems like polysemy, ambiguity, and fuzziness of sentences, creating visual images with lip synchronization remains challenging. This problem is solved employing a datamining framework to find out the synchronous pattern between different channels from large recorded audio/text dataset and visual dataset, and applying it to get realistic talking face animations. Specifically, we decompose this task into two steps: muscular movement of mouth prediction and video synthesis. First, a multimodal learning method is proposed to get accurate lip movement while speaking with multimedia inputs (both text and audio). In the second step, Face2Vid framework is used to get video frames conditioned on the projected lip movement. This model is used to translate the language within the audio to a different language and dub the video in new language alongside proper lip synchronization. This model uses tongue processing and machine translation (MT) to translate the audio then uses the generative adversarial network (GAN) and recurrent neural network (RNN) to apply proper lip synchronization. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|