hub / github.com/MeiGen-AI/InfiniteTalk

github.com/MeiGen-AI/InfiniteTalk @main sqlite

repository ↗ · DeepWiki ↗

562 symbols 1,729 edges 50 files 104 documented · 19%

README

InfinteTalk

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

Shaoshu Yang* · Zhe Kong* · Feng Gao* · Meng Cheng* · Xiangyu Liu* · Yong Zhang^✉ · Zhuoliang Kang

Wenhan Luo · Xunliang Cai · Ran He· Xiaoming Wei

^*Equal Contribution ^✉Corresponding Authors

TL; DR: InfiniteTalk is an unlimited-length talking video generation model that supports both audio-driven video-to-video and image-to-video generation

🔥 Latest News

May 21, 2026: 🚀 We release LongCat-Video-Avatar-1.5, an upgraded open-source framework for audio-driven human video generation. v1.5 replaces Wav2Vec2 with Whisper-Large for more accurate lip synchronization, achieves production-ready physical rationality and temporal stability with robust long-video generation, generalizes to stylized domains (anime, animals, complex real-world conditions), supports both single-stream and multi-stream audio inputs, and accelerates inference to 8 steps via step distillation. [ code | 🤗 weights | project page ]
Dec 16, 2025: 🚀 We are excited to announce the release of LongCat-Video-Avatar, a unified model that delivers expressive and highly dynamic audio-driven character animation, supporting native tasks including Audio-Text-to-Video, Audio-Text-Image-to-Video, and Video Continuation with seamless compatibility for both single-stream and multi-stream audio inputs. The release includes our Technical Report, code, model weights, and project page.
August 19, 2025: We release the Technique-Report , weights, and code of InfiniteTalk. The Gradio and the ComfyUI branch have been released.
August 19, 2025: We release the project page of InfiniteTalk

✨ Key Features

We propose InfiniteTalk, a novel sparse-frame video dubbing framework. Given an input video and audio track, InfiniteTalk synthesizes a new video with accurate lip synchronization while simultaneously aligning head movements, body posture, and facial expressions with the audio. Unlike traditional dubbing methods that focus solely on lips, InfiniteTalk enables infinite-length video generation with accurate lip synchronization and consistent identity preservation. Beside, InfiniteTalk can also be used as an image-audio-to-video model with an image and an audio as input. - 💬 Sparse-frame Video Dubbing – Synchronizes not only lips, but aslo head, body, and expressions - ⏱️ Infinite-Length Generation – Supports unlimited video duration - ✨ Stability – Reduces hand/body distortions compared to MultiTalk - 🚀 Lip Accuracy – Achieves superior lip synchronization to MultiTalk

🌐 Community Works

Wan2GP: Thanks deepbeepmeep for integrating InfiniteTalk in Wan2GP that is optimized for low VRAM and offers many video edtiting option and other models (MMaudio support, Qwen Image Edit, ...).
ComfyUI: Thanks for the comfyui support of kijai.

📑 Todo List

[x] Release the technical report
[x] Inference
[x] Checkpoints
[x] Multi-GPU Inference
[ ] Inference acceleration
[x] TeaCache
[x] int8 quantization
[ ] LCM distillation
[ ] Sparse Attention
[x] Run with very low VRAM
[x] Gradio demo
[x] ComfyUI

Video Demos

Video-to-video (HQ videos can be found on Google Drive )

Image-to-video

Quick Start

🛠️Installation

1. Create a conda environment and install pytorch, xformers

conda create -n multitalk python=3.10
conda activate multitalk
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121

2. Flash-attn installation:

pip install misaki[en]
pip install ninja 
pip install psutil 
pip install packaging
pip install wheel
pip install flash_attn==2.7.4.post1

3. Other dependencies

pip install -r requirements.txt
conda install -c conda-forge librosa

4. FFmeg installation

conda install -c conda-forge ffmpeg

sudo yum install ffmpeg ffmpeg-devel

🧱Model Preparation

1. Model Download

Models	Download Link	Notes
Wan2.1-I2V-14B-480P	🤗 Huggingface	Base model
chinese-wav2vec2-base	🤗 Huggingface	Audio encoder
MeiGen-InfiniteTalk	🤗 Huggingface	Our audio condition weights

Download models using huggingface-cli:

huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download MeiGen-AI/InfiniteTalk --local-dir ./weights/InfiniteTalk

🔑 Quick Inference

Our model is compatible with both 480P and 720P resolutions.

Some tips - Lip synchronization accuracy: Audio CFG works optimally between 3–5. Increase the audio CFG value for better synchronization. - FusionX： While it enables faster inference and higher quality, FusionX LoRA exacerbates color shift over 1 minute and reduces ID preservation in videos. - V2V generation: Enables unlimited length generation. The model mimics the original video's camera movement, though not identically. Using SDEdit improves camera movement accuracy significantly but introduces color shift and is best suited for short clips. Improvements for long video camera control are planned. - I2V generation: Generates good results from a single image for up to 1 minute. Beyond 1 minute, color shifts become more pronounced. One trick for the high-quailty generation beyond 1 min is to copy the image to a video by translating or zooming in the image. Here is a script to convert image to video.
- Quantization model: If your inference process is killed due to insufficient memory, we suggest using the quantization model, which can help reduce memory usage.

Usage of InfiniteTalk

--mode streaming: long video generation.
--mode clip: generate short video with one chunk. 
--use_teacache: run with TeaCache.
--size infinitetalk-480: generate 480P video.
--size infinitetalk-720: generate 720P video.
--use_apg: run with APG.
--teacache_thresh: A coefficient used for TeaCache acceleration
—-sample_text_guide_scale： When not using LoRA, the optimal value is 5. After applying LoRA, the recommended value is 1.
—-sample_audio_guide_scale： When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
—-sample_audio_guide_scale： When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
--max_frame_num: The max frame length of the generated video, the default is 40 seconds(1000 frames).

1. Inference

1) Run with single GPU

python generate_infinitetalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
    --input_json examples/single_example_image.json \
    --size infinitetalk-480 \
    --sample_steps 40 \
    --mode streaming \
    --motion_frame 9 \
    --save_file infinitetalk_res

2) Run with 720P

If you want run with 720P, set --size infinitetalk-720:

python generate_infinitetalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
    --input_json examples/single_example_image.json \
    --size infinitetalk-720 \
    --sample_steps 40 \
    --mode streaming \
    --motion_frame 9 \
    --save_file infinitetalk_res_720p

3) Run with very low VRAM

If you want run with very low VRAM, set --num_persistent_param_in_dit 0:

python generate_infinitetalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
    --input_json examples/single_example_image.json \
    --size infinitetalk-480 \
    --sample_steps 40 \
    --num_persistent_param_in_dit 0 \
    --mode streaming \
    --motion_frame 9 \
    --save_file infinitetalk_res_lowvram

4) Multi-GPU inference

GPU_NUM=8
torchrun --nproc_per_node=$GPU_NUM --standalone generate_infinitetalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
    --dit_fsdp --t5_fsdp \
    --ulysses_size=$GPU_NUM \
    --input_json examples/single_example_image.json \
    --size infinitetalk-480 \
    --sample_steps 40 \
    --mode streaming \
    --motion_frame 9 \
    --save_file infinitetalk_res_multigpu

5) Multi-Person animation

python generate_infinitetalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
    --input_json examples/multi_example_image.json \
    --size infinitetalk-480 \
    --sample_steps 40 \
    --num_persistent_param_in_dit 0 \
    --mode streaming \
    --motion_frame 9 \
    --save_file infinitetalk_res_multiperson

2. R

Core symbols most depended-on inside this repo

wan/utils/multitalk_utils.py

wan/modules/attention.py

_sigma_to_alpha_sigma_t

called by 10

wan/utils/fm_solvers.py

wan/utils/fm_solvers.py

Shape

Method 336

Function 124

Class 102

Languages

Python100%

Modules by API surface

wan/modules/vae.py39 symbols

wan/modules/t5.py37 symbols

wan/modules/multitalk_model.py37 symbols

kokoro/istftnet.py36 symbols

wan/modules/model.py34 symbols

wan/modules/clip.py32 symbols

wan/utils/vace_processor.py22 symbols

wan/utils/fm_solvers.py22 symbols

kokoro/modules.py21 symbols

wan/utils/multitalk_utils.py20 symbols

wan/utils/fm_solvers_unipc.py19 symbols

src/vram_management/layers.py19 symbols

Dependencies from manifests, versioned

accelerate1.1.1 · 1×

diffusers0.31.0 · 1×

gradio5.0.0 · 1×

moviepy1.0.3 · 1×

numpy1.23.5 · 1×

opencv-python4.9.0.80 · 1×

optimum-quanto0.2.6 · 1×

tokenizers0.20.3 · 1×

transformers4.49.0 · 1×

xfuser0.4.1 · 1×

For agents

$ claude mcp add InfiniteTalk \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact