View on GitHub

lip2speech-unit

Intelligible Lip-to-Speech Synthesis with Speech Units (Interspeech 2023)

Intelligible Lip-to-Speech Synthesis with Speech Units

Jeongsoo Choi, Minsu Kim, Yong Man Ro

[Paper] [Code]

Abstract

In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video. Specifically, to complement the insufficient supervisory signal of the previous L2S model, we propose to use quantized self-supervised speech representations, named speech units, as an additional prediction target for the L2S model. Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units. As the speech units are discrete while mel-spectrogram is continuous, the proposed multi-target L2S model can be trained with strong content supervision, without using text-labeled data. Moreover, to accurately convert the synthesized mel-spectrogram into a waveform, we introduce a multi-input vocoder that can generate a clear waveform even from blurry and noisy mel-spectrogram by referring to the speech units. Extensive experimental results confirm the effectiveness of the proposed method in L2S.

Random samples from LRS3 Dataset

Silent video Ground Truth Ours
(Proposed + AV-HuBERT)
Ours
(Proposed)
Ours
(Proposed w/o aug)
Multi-Task SVTS VCA-GAN Text
they are the basis of every action that you take

we don't trust the man

otherwise millions more will die
and then something falls off the wall
and that's powerful
so the answer to the second question can we change
they want to be part of it
we paid two to three times more than anybody else
and that's why it's been a pleasure speaking to you
government officials are extremely mad
but you know what
thank you very much
beth israel's in boston
but we're not there yet
and they don't need to ask for permission
how much they do
so what can we do
they were wonderful people
how do you change your behavior
he could come over and help me