hub / github.com/SkyworkAI/SkyReels-V2

github.com/SkyworkAI/SkyReels-V2 @main sqlite

250 symbols 747 edges 25 files 45 documented · 18%

README

SkyReels Logo

SkyReels V2: Infinite-Length Film Generative Model

📑 Technical Report · 👋 Playground · 💬 Discord · 🤗 Hugging Face · 🤖 ModelScope

Welcome to the SkyReels V2 repository! Here, you'll find the model weights and inference code for our infinite-length film generative models. To the best of our knowledge, it represents the first open-source video generative model employing AutoRegressive Diffusion-Forcing architecture that achieves the SOTA performance among publicly available models.

🔥🔥🔥 News!!

Jan 29, 2026: 🎉 We launched the API for the SkyReels-V3 models on the apifree.ai.
Jan 29, 2026: 🎉 We release the inference code and model weights of SkyReels-V3.
Jun 1, 2025: 🎉 We published the technical report, SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers.
May 16, 2025: 🔥 We release the inference code for video extension and start/end frame control in diffusion forcing model.
Apr 24, 2025: 🔥 We release the 720P models, SkyReels-V2-DF-14B-720P and SkyReels-V2-I2V-14B-720P. The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis.
Apr 21, 2025: 👋 We release the inference code and model weights of SkyReels-V2 Series Models and the video captioning model SkyCaptioner-V1 .
Apr 3, 2025: 🔥 We also release SkyReels-A2. This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements.
Feb 18, 2025: 🔥 we released SkyReels-A1. This is an open-sourced and effective framework for portrait image animation.
Feb 18, 2025: 🔥 We released SkyReels-V1. This is the first and most advanced open-source human-centric video foundation model.

🎥 Demos

The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model.

📑 TODO List

[x] Technical Report
[x] Checkpoints of the 14B and 1.3B Models Series
[x] Single-GPU & Multi-GPU Inference Code
[x] SkyCaptioner-V1: A Video Captioning Model
[x] Prompt Enhancer
[x] Diffusers integration
[ ] Checkpoints of the 5B Models Series
[ ] Checkpoints of the Camera Director Models
[ ] Checkpoints of the Step & Guidance Distill Model

🚀 Quickstart

Installation

# clone the repository.
git clone https://github.com/SkyworkAI/SkyReels-V2
cd SkyReels-V2
# Install dependencies. Test environment uses Python 3.10.12.
pip install -r requirements.txt

Model Download

You can download our models from Hugging Face:

Type	Model Variant	Recommended Height/Width/Frame	Link
Diffusion Forcing	1.3B-540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
5B-540P	544 * 960 * 97f	Coming Soon
5B-720P	720 * 1280 * 121f	Coming Soon
14B-540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
14B-720P	720 * 1280 * 121f	🤗 Huggingface 🤖 ModelScope
Text-to-Video	1.3B-540P	544 * 960 * 97f	Coming Soon
5B-540P	544 * 960 * 97f	Coming Soon
5B-720P	720 * 1280 * 121f	Coming Soon
14B-540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
14B-720P	720 * 1280 * 121f	🤗 Huggingface 🤖 ModelScope
Image-to-Video	1.3B-540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
5B-540P	544 * 960 * 97f	Coming Soon
5B-720P	720 * 1280 * 121f	Coming Soon
14B-540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
14B-720P	720 * 1280 * 121f	🤗 Huggingface 🤖 ModelScope
Camera Director	5B-540P	544 * 960 * 97f	Coming Soon
5B-720P	720 * 1280 * 121f	Coming Soon
14B-720P	720 * 1280 * 121f	Coming Soon

After downloading, set the model path in your generation commands:

Single GPU Inference

Diffusion Forcing for Long Video Generation

The Diffusion Forcing version model allows us to generate Infinite-Length videos. This model supports both text-to-video (T2V) and image-to-video (I2V) tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first.

synchronous generation for 10s video

model_id=Skywork/SkyReels-V2-DF-14B-540P
# synchronous inference
python3 generate_video_df.py \
  --model_id ${model_id} \
  --resolution 540P \
  --ar_step 0 \
  --base_num_frames 97 \
  --num_frames 257 \
  --overlap_history 17 \
  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
  --addnoise_condition 20 \
  --offload \
  --teacache \
  --use_ret_steps \
  --teacache_thresh 0.3

asynchronous generation for 30s video

model_id=Skywork/SkyReels-V2-DF-14B-540P
# asynchronous inference
python3 generate_video_df.py \
  --model_id ${model_id} \
  --resolution 540P \
  --ar_step 5 \
  --causal_block_size 5 \
  --base_num_frames 97 \
  --num_frames 737 \
  --overlap_history 17 \
  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
  --addnoise_condition 20 \
  --offload

Text-to-video with diffusers:

import torch
from diffusers import AutoModel, SkyReelsV2DiffusionForcingPipeline, UniPCMultistepScheduler
from diffusers.utils import export_to_video

vae = AutoModel.from_pretrained("Skywork/SkyReels-V2-DF-14B-540P-Diffusers", subfolder="vae", torch_dtype=torch.float32)

pipeline = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
    "Skywork/SkyReels-V2-DF-14B-540P-Diffusers",
    vae=vae,
    torch_dtype=torch.bfloat16
)
flow_shift = 8.0  # 8.0 for T2V, 5.0 for I2V
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
pipeline = pipeline.to("cuda")

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipeline(
    prompt=prompt,
    num_inference_steps=30,
    height=544,  # 720 for 720P
    width=960,   # 1280 for 720P
    num_frames=97,
    base_num_frames=97,  # 121 for 720P
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    causal_block_size=5,  # Number of frames in each block for asynchronous processing
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long video generations
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "T2V.mp4", fps=24, quality=8)

Image-to-video with diffusers:

import numpy as np
import torch
import torchvision.transforms.functional as TF
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline, UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image

model_id = "Skywork/SkyReels-V2-DF-14B-720P-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipeline = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
    model_id, vae=vae, torch_dtype=torch.bfloat16
)
flow_shift = 5.0  # 8.0 for T2V, 5.0 for I2V
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
pipeline.to("cuda")

first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

def aspect_ratio_resize(image, pipeline, max_area=720 * 1280):
    aspect_ratio = image.height / image.width
    mod_value = pipeline.vae_scale_factor_spatial * pipeline.transformer.config.patch_size[1]
    height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
    width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
    image = image.resize((width, height))
    return image, height, width

def center_crop_resize(image, height, width):
    # Calculate resize ratio to match first frame dimensions
    resize_ratio = max(width / image.width, height / image.height)

    # Resize the image
    width = round(image.width * resize_ratio)
    height = round(image.height * resize_ratio)
    size = [width, height]
    image = TF.center_crop(image, size)

    return image, height, width

first_frame, height, width = aspect_ratio_resize(first_frame, pipeline)
if last_frame.size != first_frame.size:
    last_frame, _, _ = center_crop_resize(last_frame, height, width)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

output = pipeline(
    image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.0
).frames[0]
export_to_video(output, "output.mp4", fps=24, quality=8)

Note: - If you want to run the image-to-video (I2V) task, add --image ${image_path} to your command and it is also better to use text-to-video (T2V)-like prompt which includes some descriptions of the first-frame image. - For long video generation, you can just switch the --num_frames, e.g

Core symbols most depended-on inside this repo

called by 106

skyreels_v2_infer/modules/vae.py

encode

called by 12

skyreels_v2_infer/modules/vae.py

_sigma_to_alpha_sigma_t

called by 8

skyreels_v2_infer/scheduler/fm_solvers_unipc.py

set_timesteps

called by 7

skyreels_v2_infer/scheduler/fm_solvers_unipc.py

flash_attention

called by 7

skyreels_v2_infer/modules/attention.py

step

called by 6

skyreels_v2_infer/scheduler/fm_solvers_unipc.py

fp16_clamp

called by 5

skyreels_v2_infer/modules/t5.py

decode

called by 5

skyreels_v2_infer/modules/vae.py

Shape

Method 152

Class 53

Function 45

Languages

Python100%

Modules by API surface

skyreels_v2_infer/modules/transformer.py45 symbols

skyreels_v2_infer/modules/vae.py40 symbols

skyreels_v2_infer/modules/t5.py37 symbols

skyreels_v2_infer/modules/clip.py32 symbols

skyreels_v2_infer/scheduler/fm_solvers_unipc.py19 symbols

skyreels_v2_infer/modules/xlm_roberta.py10 symbols

skyreels_v2_infer/pipelines/diffusion_forcing_pipeline.py9 symbols

skycaptioner_v1/scripts/vllm_struct_caption.py8 symbols

skyreels_v2_infer/modules/tokenizers.py7 symbols

skyreels_v2_infer/distributed/xdit_context_parallel.py7 symbols

skycaptioner_v1/scripts/vllm_fusion_caption.py6 symbols

skyreels_v2_infer/modules/__init__.py5 symbols

Dependencies from manifests, versioned

accelerate1.6.0 · 1×

decord0.6.0 · 1×

diffusers0.31.0 · 1×

numpy1.23.5 · 1×

opencv-python4.10.0.84 · 1×

tokenizers0.21.1 · 1×

torch2.5.1 · 1×

torchvision0.20.1 · 1×

transformers4.49.0 · 1×

vllm0.8.4 · 1×

For agents

$ claude mcp add SkyReels-V2 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact