hub / github.com/TMElyralab/MuseTalk

github.com/TMElyralab/MuseTalk @main

478 symbols 1,393 edges 54 files 110 documented · 23%

README

MuseTalk

MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling

Yue Zhang^*, Zhizhou Zhong^*, Minhao Liu^*, Zhaokang Chen, Bin Wu^†, Yubin Zeng, Chao Zhan, Junxin Huang, Yingjie He, Wenjiang Zhou (^*Equal Contribution, ^†Corresponding Author, benbinwu@tencent.com)

Lyra Lab, Tencent Music Entertainment

github huggingface space Technical report

We introduce MuseTalk, a real-time high quality lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by MuseV, as a complete virtual human solution.

🔥 Updates

We're excited to unveil MuseTalk 1.5. This version (1) integrates training with perceptual loss, GAN loss, and sync loss, significantly boosting its overall performance. (2) We've implemented a two-stage training strategy and a spatio-temporal data sampling approach to strike a balance between visual quality and lip-sync accuracy. Learn more details here. The inference codes, training codes and model weights of MuseTalk 1.5 are all available now! 🚀

Overview

MuseTalk is a real-time high quality audio-driven lip-syncing model trained in the latent space of ft-mse-vae, which

modifies an unseen face according to the input audio, with a size of face region of 256 x 256.
supports audio in various languages, such as Chinese, English, and Japanese.
supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
supports modification of the center point of the face region proposes, which SIGNIFICANTLY affects generation results.
checkpoint available trained on the HDTF and private dataset.

News

[04/05/2025] :mega: We are excited to announce that the training code is now open-sourced! You can now train your own MuseTalk model using our provided training scripts and configurations.
[03/28/2025] We are thrilled to announce the release of our 1.5 version. This version is a significant improvement over the 1.0 version, with enhanced clarity, identity consistency, and precise lip-speech synchronization. We update the technical report with more details.
[10/18/2024] We release the technical report. Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance.
[04/17/2024] We release a pipeline that utilizes MuseTalk for real-time inference.
[04/16/2024] Release Gradio demo on HuggingFace Spaces (thanks to HF team for their community grant)
[04/02/2024] Release MuseTalk project and pretrained models.

Model

Model Structure MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed whisper-tiny model. The architecture of the generation network was borrowed from the UNet of the stable-diffusion-v1-4, where the audio embeddings were fused to the image embeddings by cross-attention.

Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is NOT a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.

Cases

### Input Video --- https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107 --- https://github.com/user-attachments/assets/1ce3e850-90ac-4a31-a45f-8dfa4f2960ac --- https://github.com/user-attachments/assets/fa3b13a1-ae26-4d1d-899e-87435f8d22b3 --- https://github.com/user-attachments/assets/15800692-39d1-4f4c-99f2-aef044dc3251 --- https://github.com/user-attachments/assets/a843f9c9-136d-4ed4-9303-4a7269787a60 --- https://github.com/user-attachments/assets/6eb4e70e-9e19-48e9-85a9-bbfa589c5fcb

### MuseTalk 1.0 --- https://github.com/user-attachments/assets/c04f3cd5-9f77-40e9-aafd-61978380d0ef --- https://github.com/user-attachments/assets/2051a388-1cef-4c1d-b2a2-3c1ceee5dc99 --- https://github.com/user-attachments/assets/b5f56f71-5cdc-4e2e-a519-454242000d32 --- https://github.com/user-attachments/assets/a5843835-04ab-4c31-989f-0995cfc22f34 --- https://github.com/user-attachments/assets/3dc7f1d7-8747-4733-bbdd-97874af0c028 --- https://github.com/user-attachments/assets/3c78064e-faad-4637-83ae-28452a22b09a

### MuseTalk 1.5 --- https://github.com/user-attachments/assets/999a6f5b-61dd-48e1-b902-bb3f9cbc7247 --- https://github.com/user-attachments/assets/d26a5c9a-003c-489d-a043-c9a331456e75 --- https://github.com/user-attachments/assets/471290d7-b157-4cf6-8a6d-7e899afa302c --- https://github.com/user-attachments/assets/1ee77c4c-8c70-4add-b6db-583a12faa7dc --- https://github.com/user-attachments/assets/370510ea-624c-43b7-bbb0-ab5333e0fcc4 --- https://github.com/user-attachments/assets/b011ece9-a332-4bc1-b8b7-ef6e383d7bde

TODO:

[x] trained models and inference codes.
[x] Huggingface Gradio demo.
[x] codes for real-time inference.
[x] technical report.
[x] a better model with updated technical report.
[x] realtime inference code for 1.5 version.
[x] training and data preprocessing codes.
[ ] always welcome to submit issues and PRs to improve this repository! 😊

Getting Started

We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:

Third party integration

Thanks for the third-party integration, which makes installation and use more convenient for everyone. We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.

ComfyUI

Installation

To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:

Build environment

We recommend Python 3.10 and CUDA 11.7. Set up your environment as follows:

conda create -n MuseTalk python==3.10
conda activate MuseTalk

Install PyTorch 2.0.1

Choose one of the following installation methods:

# Option 1: Using pip
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

# Option 2: Using conda
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia

Install Dependencies

Install the remaining required packages:

pip install -r requirements.txt

Install MMLab Packages

Install the MMLab ecosystem packages:

pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv==2.0.1"
mim install "mmdet==3.1.0"
mim install "mmpose==1.1.0"

Setup FFmpeg

Download the ffmpeg-static package
Configure FFmpeg based on your operating system:

For Linux:

export FFMPEG_PATH=/path/to/ffmpeg
# Example:
export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static

For Windows: Add the ffmpeg-xxx\bin directory to your system's PATH environment variable. Verify the installation by running ffmpeg -version in the command prompt - it should display the ffmpeg version information.

Download weights

You can download weights in two ways:

Option 1: Using Download Scripts

We provide two scripts for automatic downloading:

For Linux:

sh ./download_weights.sh

For Windows:

# Run the script
download_weights.bat

Option 2: Manual Download

You can also download the weights manually from the following links:

Download our trained weights
Download the weights of other components:
sd-vae-ft-mse
whisper
dwpose
syncnet
face-parse-bisent
resnet18

Finally, these weights should be organized in models as follows:

./models/
├── musetalk
│   └── musetalk.json
│   └── pytorch_model.bin
├── musetalkV15
│   └── musetalk.json
│   └── unet.pth
├── syncnet
│   └── latentsync_syncnet.pt
├── dwpose
│   └── dw-ll_ucoco_384.pth
├── face-parse-bisent
│   ├── 79999_iter.pth
│   └── resnet18-5c106cde.pth
├── sd-vae
│   ├── config.json
│   └── diffusion_pytorch_model.bin
└── whisper
    ├── config.json
    ├── pytorch_model.bin
    └── preprocessor_config.json

Quickstart

Inference

We provide inference scripts for both versions of MuseTalk:

Prerequisites

Before running inference, please ensure ffmpeg is installed and accessible:

# Check ffmpeg installation
ffmpeg -version

If ffmpeg is not found, please install it first: - Windows: Download from ffmpeg-static and add to PATH - Linux: sudo apt-get install ffmpeg

Normal Inference

Linux Environment

# MuseTalk 1.5 (Recommended)
sh inference.sh v1.5 normal

# MuseTalk 1.0
sh inference.sh v1.0 normal

Windows Environment

Please ensure that you set the ffmpeg_path to match the actual location of your FFmpeg installation.

# MuseTalk 1.5 (Recommended)
python -m scripts.inference --inference_config configs\inference\test.yaml --result_dir results\test --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin

# For MuseTalk 1.0, change:
# - models\musetalkV15 -> models\musetalk
# - unet.pth -> pytorch_model.bin
# - --version v15 -> --version v1

Real-time Inference

Linux Environment

# MuseTalk 1.5 (Recommended)
sh inference.sh v1.5 realtime

# MuseTalk 1.0
sh inference.sh v1.0 realtime

Windows Environment

# MuseTalk 1.5 (Recommended)
python -m scripts.realtime_inference --inference_config configs\inference\realtime.yaml --result_dir results\realtime --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --fps 25 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin

# For MuseTalk 1.0, change:
# - models\musetalkV15 -> models\musetalk
# - unet.pth -> pytorch_model.bin
# - --version v15 -> --version v1

The configuration file configs/inference/test.yaml contains the inference settings, including: - video_path: Path to the input video, image file, or directory of images - audio_path: Path to the input audio file

Note: For optimal results, we recommend using input videos with 25fps, which is the same fps used during model training. If your video has a lower frame rate, you can use frame interpolation or convert it to 25fps using ffmpeg.

Important notes for real-time inference: 1. Set preparation to True when processing a new avatar 2. After preparation, the avatar will generate videos using audio clips from audio_clips 3. The generation process can achieve 30fps+ on an NVIDIA Tesla V100 4. Set preparation to False for generating more videos with the same avatar

For faster generation without saving images, you can use:

python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --skip_save_images

Gradio Demo

We provide an intuitive web interface through Gradio for users to easily adjust input parameters. To optimize inference time, users can generate only the first frame to fine-tune the best lip-sync parameters, which helps reduce facial artifacts in the final output. para For minimum hardware requirements, we tested the system on a Windows environment using an NVIDIA GeForce RTX 3050 Ti Laptop GPU with 4GB VRAM. In fp16 mode, generating an 8-second video takes approximately 5 minutes. speed

Both Linux and Windows users can launch the demo using the following command. Please ensure that the ffmpeg_path parameter matches your actual FFmpeg installation path:

# You can remove --use_float16 for better quality, but it will increase VRAM usage and inference time
python app.py --use_float16 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin

Training

Data Preparation

To train MuseTalk, you need to prepare your dataset following these steps:

Place your source videos

For example, if you're using the HDTF dataset, place all your video files in ./dataset/HDTF/source.

Run the preprocessing script bash python -m scripts.preprocess --config ./configs/training/preprocess.yaml This script will:
Extract frames from videos
Detect and align faces
Generate audio features
Create the necessary data structure for training

Training Process

After data preprocessing, you can start the training process:

First Stage bash sh train.sh stage1
Second Stage bash sh train.sh stage2

Configuration Adjustment

Before starting the training, you should adjust the configuration files according to your hardware a

Core symbols most depended-on inside this repo

encode

called by 15

musetalk/whisper/whisper/tokenizer.py

run

called by 11

musetalk/whisper/whisper/decoding.py

device

called by 9

musetalk/whisper/whisper/model.py

read_imgs

called by 8

musetalk/utils/preprocessing.py

decode

called by 7

musetalk/whisper/whisper/tokenizer.py

softmax

called by 6

musetalk/utils/face_detection/api.py

get_landmark_and_bbox

called by 5

musetalk/utils/preprocessing.py

_get_single_token_id

called by 5

musetalk/whisper/whisper/tokenizer.py

Shape

Method 254

Function 140

Class 84

Languages

Python100%

Modules by API surface

musetalk/whisper/whisper/decoding.py52 symbols

musetalk/utils/face_parsing/model.py33 symbols

musetalk/whisper/whisper/model.py31 symbols

musetalk/data/audio.py21 symbols

musetalk/whisper/whisper/tokenizer.py19 symbols

musetalk/utils/face_detection/models.py19 symbols

musetalk/loss/vgg_face.py18 symbols

musetalk/data/dataset.py18 symbols

musetalk/whisper/whisper/normalizers/english.py16 symbols

musetalk/utils/face_detection/api.py16 symbols

musetalk/utils/utils.py14 symbols

musetalk/models/syncnet.py14 symbols

Dependencies from manifests, versioned

accelerate0.28.0 · 1×

diffusers0.30.2 · 1×

einops0.8.1 · 1×

gradio5.24.0 · 1×

huggingface_hub0.30.2 · 1×

librosa0.11.0 · 1×

numpy1.23.5 · 1×

opencv-python4.9.0.80 · 1×

soundfile0.12.1 · 1×

tensorboard2.12.0 · 1×

tensorflow2.12.0 · 1×

transformers4.39.2 · 1×

For agents

$ claude mcp add MuseTalk \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact