Reprogramming Talking Face Synthesis

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
School of Electrical Engineering, KAIST, South Korea

Contents

Samples from GRID Dataset
Samples from LRS2 Dataset
Samples using PC-AVS from LRS2 Dataset

Notes

For [Ours] and [ParaLip], we used ground-truth text-audio alignment information to compare with real speech audio.
For [Wav2Lip with TTS] and [PC-AVS with TTS], we also used ground truth text-audio alignment information to synthesize speech and the speech was used as the input for each audio-driven talking face synthesis model. Because of the domain gap between real audio and synthesized audio, artifacts exist especially when silent.
For all methods, the real speech audio was inserted as a reference.
Compared to [ParaLip], the samples of [Ours] have more dynamic and plausible lip movement thanks to the expressiveness of frozen audio-driven model. Please see GRID (2, 4) and LRS2 (1, 2, 4) samples.

<Samples from GRID Dataset>

ID	Ground Truth	Ours	ParaLip	Wav2Lip with TTS	PC-AVS with TTS
1
2
3
4
5

<Samples from LRS2 Dataset>

ID	Ground Truth	Ours	ParaLip	Wav2Lip with TTS	PC-AVS with TTS
1
2
3
4
5

<Samples using PC-AVS from LRS2 Dataset>

* Generated outputs were attached to ground-truth video. Because the source image is only from the first frame of each video, all the methods have noticeable boundaries.

* [PC-AVS] is audio-driven method and [Ours (on PC-AVS)] and [PC-AVS with TTS] are text-driven methods.

ID	Ground Truth	PC-AVS	Ours (on PC-AVS)	PC-AVS with TTS
1
2
3
4
5