StyleLipSync: Style-based Personalized Lip-sync Video Generation

Abstract

In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lips-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.

Figure 1.Overview of StyleLipSync

Comparison with Other Methods

StyleLipSync can generate lip-sync video with highly accurate sync quality. Here, we exhibit comparison results (zero-shot) of reconstruction (Voxceleb2 test) and Cross-identity (HDTF).

Video 1. Reconstruction results on Voxceleb2

Video 2. Cross-identity result on HDTF

Ablation Studies

Pose-aware Masking

We propose pose-aware masking by utilizing a 3D parametric facial model, which helps the model be aware of pose information without external pose encoding.

Video 3. Example of pose-aware masking

Architectural Studies

StyleLipSync contains Moving-average based Latent Smoothing (MaLS) and Style-aware Masked Fusion (SaMF). MaLS enhances the temporal consistency by learning the smooth local transition of the video trajectory lying in the style latent space. SaMF helps the model attend to the masked region the using the style-modulated convolution that inherits the temporal consistency through the MaLS.

Video 4. Ablation study (1)

Video 5. Ablation study (2)

Zero-shot vs Adaptation

We additionally propose a few-shot single-person adaptation method, which requires only a few seconds of target-person video. While preserving the lip-sync accuracy of the zero-shot model, it can enhance the person's specific visual information, such as the shape of lips and teeth.

Video 6. Zero-shot vs Adaptation (1)

Video 7. Zero-shot vs Adaptation (2)

Visualization of Style-aware Masked Fusion

Video 8. Visualization of Style-aware Masked Fusion

BibTeX

@InProceedings{Ki_2023_ICCV,
      author    = {Ki, Taekyung and Min, Dongchan},
      title     = {StyleLipSync: Style-based Personalized Lip-sync Video Generation},
      booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      month     = {October},
      year      = {2023},
      pages     = {22841-22850}
    }