Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
School of Electrical Engineering, KAIST, South Korea
Contents
Notes
- For [Ours] and [ParaLip], we used ground-truth text-audio alignment information to compare with real speech audio.
- For [Wav2Lip with TTS] and [PC-AVS with TTS], we also used ground truth text-audio alignment information to synthesize speech and the speech was used as the input for each audio-driven talking face synthesis model. Because of the domain gap between real audio and synthesized audio, artifacts exist especially when silent.
- For all methods, the real speech audio was inserted as a reference.
- Compared to [ParaLip], the samples of [Ours] have more dynamic and plausible lip movement thanks to the expressiveness of frozen audio-driven model. Please see GRID (2, 4) and LRS2 (1, 2, 4) samples.
<Samples from GRID Dataset>
ID | Ground Truth | Ours | ParaLip | Wav2Lip with TTS | PC-AVS with TTS |
---|---|---|---|---|---|
1 | |||||
2 | |||||
3 | |||||
4 | |||||
5 |
<Samples from LRS2 Dataset>
ID | Ground Truth | Ours | ParaLip | Wav2Lip with TTS | PC-AVS with TTS |
---|---|---|---|---|---|
1 | |||||
2 | |||||
3 | |||||
4 | |||||
5 |
<Samples using PC-AVS from LRS2 Dataset>
* Generated outputs were attached to ground-truth video. Because the source image is only from the first frame of each video, all the methods have noticeable boundaries.
* [PC-AVS] is audio-driven method and [Ours (on PC-AVS)] and [PC-AVS with TTS] are text-driven methods.
ID | Ground Truth | PC-AVS | Ours (on PC-AVS) | PC-AVS with TTS |
---|---|---|---|---|
1 | ||||
2 | ||||
3 | ||||
4 | ||||
5 |