Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models


Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
School of Electrical Engineering, KAIST, South Korea

Contents


Notes

  • For [Ours] and [ParaLip], we used ground-truth text-audio alignment information to compare with real speech audio.
  • For [Wav2Lip with TTS] and [PC-AVS with TTS], we also used ground truth text-audio alignment information to synthesize speech and the speech was used as the input for each audio-driven talking face synthesis model. Because of the domain gap between real audio and synthesized audio, artifacts exist especially when silent.
  • For all methods, the real speech audio was inserted as a reference.
  • Compared to [ParaLip], the samples of [Ours] have more dynamic and plausible lip movement thanks to the expressiveness of frozen audio-driven model. Please see GRID (2, 4) and LRS2 (1, 2, 4) samples.

<Samples from GRID Dataset>


ID Ground Truth Ours ParaLip Wav2Lip with TTS PC-AVS with TTS
1
2
3
4
5

<Samples from LRS2 Dataset>


ID Ground Truth Ours ParaLip Wav2Lip with TTS PC-AVS with TTS
1
2
3
4
5

<Samples using PC-AVS from LRS2 Dataset>


* Generated outputs were attached to ground-truth video. Because the source image is only from the first frame of each video, all the methods have noticeable boundaries.

* [PC-AVS] is audio-driven method and [Ours (on PC-AVS)] and [PC-AVS with TTS] are text-driven methods.

ID Ground Truth PC-AVS Ours (on PC-AVS) PC-AVS with TTS
1
2
3
4
5