MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling
Yue Zhang*, Zhizhou Zhong*, Minhao Liu*, Zhaokang Chen, Bin Wu†, Yubin Zeng, Chao Zhan, Junxin Huang, Yingjie He, Wenjiang Zhou (*Equal Contribution, †Corresponding Author, benbinwu@tencent.com)
Lyra Lab, Tencent Music Entertainment
github huggingface space Technical report
We introduce MuseTalk, a real-time high quality lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by MuseV, as a complete virtual human solution.
We're excited to unveil MuseTalk 1.5. This version (1) integrates training with perceptual loss, GAN loss, and sync loss, significantly boosting its overall performance. (2) We've implemented a two-stage training strategy and a spatio-temporal data sampling approach to strike a balance between visual quality and lip-sync accuracy. Learn more details here. The inference codes, training codes and model weights of MuseTalk 1.5 are all available now! 🚀
MuseTalk is a real-time high quality audio-driven lip-syncing model trained in the latent space of ft-mse-vae, which
256 x 256.
MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed
whisper-tiny model. The architecture of the generation network was borrowed from the UNet of the stable-diffusion-v1-4, where the audio embeddings were fused to the image embeddings by cross-attention.
Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is NOT a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.
| ### Input Video --- https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107 --- https://github.com/user-attachments/assets/1ce3e850-90ac-4a31-a45f-8dfa4f2960ac --- https://github.com/user-attachments/assets/fa3b13a1-ae26-4d1d-899e-87435f8d22b3 --- https://github.com/user-attachments/assets/15800692-39d1-4f4c-99f2-aef044dc3251 --- https://github.com/user-attachments/assets/a843f9c9-136d-4ed4-9303-4a7269787a60 --- https://github.com/user-attachments/assets/6eb4e70e-9e19-48e9-85a9-bbfa589c5fcb | ### MuseTalk 1.0 --- https://github.com/user-attachments/assets/c04f3cd5-9f77-40e9-aafd-61978380d0ef --- https://github.com/user-attachments/assets/2051a388-1cef-4c1d-b2a2-3c1ceee5dc99 --- https://github.com/user-attachments/assets/b5f56f71-5cdc-4e2e-a519-454242000d32 --- https://github.com/user-attachments/assets/a5843835-04ab-4c31-989f-0995cfc22f34 --- https://github.com/user-attachments/assets/3dc7f1d7-8747-4733-bbdd-97874af0c028 --- https://github.com/user-attachments/assets/3c78064e-faad-4637-83ae-28452a22b09a | ### MuseTalk 1.5 --- https://github.com/user-attachments/assets/999a6f5b-61dd-48e1-b902-bb3f9cbc7247 --- https://github.com/user-attachments/assets/d26a5c9a-003c-489d-a043-c9a331456e75 --- https://github.com/user-attachments/assets/471290d7-b157-4cf6-8a6d-7e899afa302c --- https://github.com/user-attachments/assets/1ee77c4c-8c70-4add-b6db-583a12faa7dc --- https://github.com/user-attachments/assets/370510ea-624c-43b7-bbb0-ab5333e0fcc4 --- https://github.com/user-attachments/assets/b011ece9-a332-4bc1-b8b7-ef6e383d7bde |
We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
Thanks for the third-party integration, which makes installation and use more convenient for everyone. We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
We recommend Python 3.10 and CUDA 11.7. Set up your environment as follows:
conda create -n MuseTalk python==3.10
conda activate MuseTalk
Choose one of the following installation methods:
# Option 1: Using pip
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
# Option 2: Using conda
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
Install the remaining required packages:
pip install -r requirements.txt
Install the MMLab ecosystem packages:
pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv==2.0.1"
mim install "mmdet==3.1.0"
mim install "mmpose==1.1.0"
Download the ffmpeg-static package
Configure FFmpeg based on your operating system:
For Linux:
export FFMPEG_PATH=/path/to/ffmpeg
# Example:
export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
For Windows:
Add the ffmpeg-xxx\bin directory to your system's PATH environment variable. Verify the installation by running ffmpeg -version in the command prompt - it should display the ffmpeg version information.
You can download weights in two ways:
We provide two scripts for automatic downloading:
For Linux:
sh ./download_weights.sh
For Windows:
# Run the script
download_weights.bat
You can also download the weights manually from the following links:
Finally, these weights should be organized in models as follows:
./models/
├── musetalk
│ └── musetalk.json
│ └── pytorch_model.bin
├── musetalkV15
│ └── musetalk.json
│ └── unet.pth
├── syncnet
│ └── latentsync_syncnet.pt
├── dwpose
│ └── dw-ll_ucoco_384.pth
├── face-parse-bisent
│ ├── 79999_iter.pth
│ └── resnet18-5c106cde.pth
├── sd-vae
│ ├── config.json
│ └── diffusion_pytorch_model.bin
└── whisper
├── config.json
├── pytorch_model.bin
└── preprocessor_config.json
We provide inference scripts for both versions of MuseTalk:
Before running inference, please ensure ffmpeg is installed and accessible:
# Check ffmpeg installation
ffmpeg -version
If ffmpeg is not found, please install it first:
- Windows: Download from ffmpeg-static and add to PATH
- Linux: sudo apt-get install ffmpeg
# MuseTalk 1.5 (Recommended)
sh inference.sh v1.5 normal
# MuseTalk 1.0
sh inference.sh v1.0 normal
Please ensure that you set the ffmpeg_path to match the actual location of your FFmpeg installation.
# MuseTalk 1.5 (Recommended)
python -m scripts.inference --inference_config configs\inference\test.yaml --result_dir results\test --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
# For MuseTalk 1.0, change:
# - models\musetalkV15 -> models\musetalk
# - unet.pth -> pytorch_model.bin
# - --version v15 -> --version v1
# MuseTalk 1.5 (Recommended)
sh inference.sh v1.5 realtime
# MuseTalk 1.0
sh inference.sh v1.0 realtime
# MuseTalk 1.5 (Recommended)
python -m scripts.realtime_inference --inference_config configs\inference\realtime.yaml --result_dir results\realtime --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --fps 25 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
# For MuseTalk 1.0, change:
# - models\musetalkV15 -> models\musetalk
# - unet.pth -> pytorch_model.bin
# - --version v15 -> --version v1
The configuration file configs/inference/test.yaml contains the inference settings, including:
- video_path: Path to the input video, image file, or directory of images
- audio_path: Path to the input audio file
Note: For optimal results, we recommend using input videos with 25fps, which is the same fps used during model training. If your video has a lower frame rate, you can use frame interpolation or convert it to 25fps using ffmpeg.
Important notes for real-time inference:
1. Set preparation to True when processing a new avatar
2. After preparation, the avatar will generate videos using audio clips from audio_clips
3. The generation process can achieve 30fps+ on an NVIDIA Tesla V100
4. Set preparation to False for generating more videos with the same avatar
For faster generation without saving images, you can use:
python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --skip_save_images
We provide an intuitive web interface through Gradio for users to easily adjust input parameters. To optimize inference time, users can generate only the first frame to fine-tune the best lip-sync parameters, which helps reduce facial artifacts in the final output.
For minimum hardware requirements, we tested the system on a Windows environment using an NVIDIA GeForce RTX 3050 Ti Laptop GPU with 4GB VRAM. In fp16 mode, generating an 8-second video takes approximately 5 minutes. 
Both Linux and Windows users can launch the demo using the following command. Please ensure that the ffmpeg_path parameter matches your actual FFmpeg installation path:
# You can remove --use_float16 for better quality, but it will increase VRAM usage and inference time
python app.py --use_float16 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
To train MuseTalk, you need to prepare your dataset following these steps:
For example, if you're using the HDTF dataset, place all your video files in ./dataset/HDTF/source.
bash
python -m scripts.preprocess --config ./configs/training/preprocess.yaml
This script will:After data preprocessing, you can start the training process:
First Stage
bash
sh train.sh stage1
Second Stage
bash
sh train.sh stage2
Before starting the training, you should adjust the configuration files according to your hardware a
$ claude mcp add MuseTalk \
-- python -m otcore.mcp_server <graph>