Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data

Autor: Ben Milner, Danny Websdale, Sarah Taylor
Rok vydání: 2022
Předmět:
Zdroj: IEEE Transactions on Multimedia. 24:2539-2552
ISSN: 1941-0077
1520-9210
Popis: We propose a real-time speaker-independent speech-to-facial animation system that predicts lip and jaw movements on a reference face for audio speech taken from any speaker. Our approach is motivated by two key observations; 1) Speaker-independent facial animation can be generated from phoneme labels, but to perform this automatically a speech recogniser is needed which, due to contextual look-ahead, introduces too much time lag. 2) Audio-driven speech animation can be performed in real-time but requires large, multi-speaker audio-visual speech datasets of which there are few. We adopt a novel three-stage training procedure that leverages the advantages of each approach. First we train a \emph{phoneme}-to-visual speech model from a large single-speaker audio-visual dataset. Next, we use this model to generate the synthetic visual component of a large multi-speaker audio dataset of which the video is not available. Finally, we learn an \emph{audio}-to-visual speech mapping using the synthetic visual features as the target. Furthermore, we increase the realism of the predicted facial animation by introducing two perceptually-based loss functions that aim to improve mouth closures and openings. The proposed method and loss functions are evaluated objectively using mean square error, global variance and a new metric that measures the extent of mouth opening. Subjective tests are carried out over the best performing systems. Results show that our approach produces audio-driven facial animation that is comparable to those produced from phoneme sequences and that improved mouth closures, particularly for bilabial closures, are achieved.
Databáze: OpenAIRE