Video 7. Zero-shot vs Adaptation (2)
In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lips-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.
Figure 1.Overview of StyleLipSync
StyleLipSync can generate lip-sync video with highly accurate sync quality. Here, we exhibit comparison results (zero-shot) of reconstruction (Voxceleb2 test) and Cross-identity (HDTF).
Video 1. Reconstruction results on Voxceleb2
Video 2. Cross-identity result on HDTF
We propose pose-aware masking by utilizing a 3D parametric facial model, which helps the model be aware of pose information without external pose encoding.
Video 4. Ablation study (1)
Video 5. Ablation study (2)
We additionally propose a few-shot single-person adaptation method, which requires only a few seconds of target-person video. While preserving the lip-sync accuracy of the zero-shot model, it can enhance the person's specific visual information, such as the shape of lips and teeth.
Video 6. Zero-shot vs Adaptation (1)
Video 7. Zero-shot vs Adaptation (2)
Video 8. Visualization of Style-aware Masked Fusion
@InProceedings{Ki_2023_ICCV,
author = {Ki, Taekyung and Min, Dongchan},
title = {StyleLipSync: Style-based Personalized Lip-sync Video Generation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {22841-22850}
}