<img src="https://github.com/Wan-Video/Wan2.1/raw/main/assets/logo.png" width="400"/>
💜 <a href="https://wan.video"><b>Wan</b></a>    |    🖥️ <a href="https://github.com/Wan-Video/Wan2.1">GitHub</a>    |   🤗 <a href="https://huggingface.co/Wan-AI/">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/organization/Wan-AI">ModelScope</a>   |    📑 <a href="https://arxiv.org/abs/2503.20314">Technical Report</a>    |    📑 <a href="https://wan.video/welcome?spm=a2ty_o02.30011076.0.0.6c9ee41eCcluqg">Blog</a>    |   💬 <a href="https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg">WeChat Group</a>   |    📖 <a href="https://discord.gg/AKNgpMK4Yj">Discord</a>  
Wan: Open and Advanced Large-Scale Video Generative Models
In this repository, we present Wan2.1, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. Wan2.1 offers these key features: - 👍 SOTA Performance: Wan2.1 consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks. - 👍 Supports Consumer-grade GPUs: The T2V-1.3B model requires only 8.19 GB VRAM, making it compatible with almost all consumer-grade GPUs. It can generate a 5-second 480P video on an RTX 4090 in about 4 minutes (without optimization techniques like quantization). Its performance is even comparable to some closed-source models. - 👍 Multiple Tasks: Wan2.1 excels in Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio, advancing the field of video generation. - 👍 Visual Text Generation: Wan2.1 is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications. - 👍 Powerful Video VAE: Wan-VAE delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
If your work has improved Wan2.1 and you would like more people to see it, please inform us. - Helios, a breakthrough video generation model base on Wan2.1 that achieves minute-scale, high-quality video synthesis at 19.5 FPS on a single H100 GPU (about 10 FPS on a single Ascend NPU) —without relying on conventional long video anti-drifting strategies or standard video acceleration techniques. Visit their webpage for more details. - Video-As-Prompt, the first unified semantic-controlled video generation model based on Wan2.1-14B-I2V with a Mixture-of-Transformers architecture and in-context controls (e.g., concept, style, motion, camera). Refer to the project page for more examples. - LightX2V, a lightweight and efficient video generation framework that integrates Wan2.1 and Wan2.2, supports multiple engineering acceleration techniques for fast inference, which can run on RTX 5090 and RTX 4060 (8GB VRAM). - DriVerse, an autonomous driving world model based on Wan2.1-14B-I2V, generates future driving videos conditioned on any scene frame and given trajectory. Refer to the project page for more examples. - Training-Free-WAN-Editing, built on Wan2.1-T2V-1.3B, allows training-free video editing with image-based training-free methods, such as FlowEdit and FlowAlign. - Wan-Move, accepted to NeurIPS 2025, a framework that brings Wan2.1-I2V-14B to SOTA fine-grained, point-level motion control! Refer to their project page for more information. - EchoShot, a native multi-shot portrait video generation model based on Wan2.1-T2V-1.3B, allows generation of multiple video clips featuring the same character as well as highly flexible content controllability. Refer to their project page for more information. - AniCrafter, a human-centric animation model based on Wan2.1-14B-I2V, controls the Video Diffusion Models with 3DGS Avatars to insert and animate anyone into any scene following given motion sequences. Refer to the project page for more examples. - HyperMotion, a human image animation framework based on Wan2.1, addresses the challenge of generating complex human body motions in pose-guided animation. Refer to their website for more examples. - MagicTryOn, a video virtual try-on framework built upon Wan2.1-14B-I2V, addresses the limitations of existing models in expressing garment details and maintaining dynamic stability during human motion. Refer to their website for more examples. - ATI, built on Wan2.1-I2V-14B, is a trajectory-based motion-control framework that unifies object, local, and camera movements in video generation. Refer to their website for more examples. - Phantom has developed a unified video generation framework for single and multi-subject references based on both Wan2.1-T2V-1.3B and Wan2.1-T2V-14B. Please refer to their examples. - UniAnimate-DiT, based on Wan2.1-14B-I2V, has trained a Human image animation model and has open-sourced the inference and training code. Feel free to enjoy it! - CFG-Zero enhances Wan2.1 (covering both T2V and I2V models) from the perspective of CFG. - TeaCache now supports Wan2.1 acceleration, capable of increasing speed by approximately 2x. Feel free to give it a try! - DiffSynth-Studio provides more support for Wan2.1, including video-to-video, FP8 quantization, VRAM optimization, LoRA training, and more. Please refer to their examples.
Clone the repo:
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1
Install dependencies:
# Ensure torch >= 2.4.0
pip install -r requirements.txt
| Models | Download Link | Notes |
|---|---|---|
| T2V-14B | 🤗 Huggingface 🤖 ModelScope | Supports both 480P and 720P |
| I2V-14B-720P | 🤗 Huggingface 🤖 ModelScope | Supports 720P |
| I2V-14B-480P | 🤗 Huggingface 🤖 ModelScope | Supports 480P |
| T2V-1.3B | 🤗 Huggingface 🤖 ModelScope | Supports 480P |
| FLF2V-14B | 🤗 Huggingface 🤖 ModelScope | Supports 720P |
| VACE-1.3B | 🤗 Huggingface 🤖 ModelScope | Supports 480P |
| VACE-14B | 🤗 Huggingface 🤖 ModelScope | Supports both 480P and 720P |
💡Note: * The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution. * For the first-last frame to video generation, we train our model primarily on Chinese text-video pairs. Therefore, we recommend using Chinese prompt to achieve better results.
Download models using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
Download models using modelscope-cli:
pip install modelscope
modelscope download Wan-AI/Wan2.1-T2V-14B --local_dir ./Wan2.1-T2V-14B
This repository supports two Text-to-Video models (1.3B and 14B) and two resolutions (480P and 720P). The parameters and configurations for these models are as follows:
| Task | Resolution | Model | |
|---|---|---|---|
| 480P | 720P | ||
| t2v-14B | ✔️ | ✔️ | Wan2.1-T2V-14B |
| t2v-1.3B | ✔️ | ❌ | Wan2.1-T2V-1.3B |
To facilitate implementation, we will start with a basic version of the inference process that skips the prompt extension step.
python generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
If you encounter OOM (Out-of-Memory) issues, you can use the --offload_model True and --t5_cpu options to reduce GPU memory usage. For example, on an RTX 4090 GPU:
python generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
💡Note: If you are using the
T2V-1.3Bmodel, we recommend setting the parameter--sample_guide_scale 6. The--sample_shift parametercan be adjusted within the range of 8 to 12 based on the performance.
$ claude mcp add Wan2.1 \
-- python -m otcore.mcp_server <graph>