MCPcopy Index your code
hub / github.com/SkyworkAI/SkyReels-V2

github.com/SkyworkAI/SkyReels-V2 @main sqlite

repository ↗ · DeepWiki ↗
250 symbols 747 edges 25 files 45 documented · 18%
README

SkyReels Logo

SkyReels V2: Infinite-Length Film Generative Model

📑 Technical Report · 👋 Playground · 💬 Discord · 🤗 Hugging Face · 🤖 ModelScope


Welcome to the SkyReels V2 repository! Here, you'll find the model weights and inference code for our infinite-length film generative models. To the best of our knowledge, it represents the first open-source video generative model employing AutoRegressive Diffusion-Forcing architecture that achieves the SOTA performance among publicly available models.

🔥🔥🔥 News!!

🎥 Demos

The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model.

📑 TODO List

  • [x] Technical Report
  • [x] Checkpoints of the 14B and 1.3B Models Series
  • [x] Single-GPU & Multi-GPU Inference Code
  • [x] SkyCaptioner-V1: A Video Captioning Model
  • [x] Prompt Enhancer
  • [x] Diffusers integration
  • [ ] Checkpoints of the 5B Models Series
  • [ ] Checkpoints of the Camera Director Models
  • [ ] Checkpoints of the Step & Guidance Distill Model

🚀 Quickstart

Installation

# clone the repository.
git clone https://github.com/SkyworkAI/SkyReels-V2
cd SkyReels-V2
# Install dependencies. Test environment uses Python 3.10.12.
pip install -r requirements.txt

Model Download

You can download our models from Hugging Face:

Type Model Variant Recommended Height/Width/Frame Link
Diffusion Forcing 1.3B-540P 544 * 960 * 97f 🤗 Huggingface 🤖 ModelScope
5B-540P 544 * 960 * 97f Coming Soon
5B-720P 720 * 1280 * 121f Coming Soon
14B-540P 544 * 960 * 97f 🤗 Huggingface 🤖 ModelScope
14B-720P 720 * 1280 * 121f 🤗 Huggingface 🤖 ModelScope
Text-to-Video 1.3B-540P 544 * 960 * 97f Coming Soon
5B-540P 544 * 960 * 97f Coming Soon
5B-720P 720 * 1280 * 121f Coming Soon
14B-540P 544 * 960 * 97f 🤗 Huggingface 🤖 ModelScope
14B-720P 720 * 1280 * 121f 🤗 Huggingface 🤖 ModelScope
Image-to-Video 1.3B-540P 544 * 960 * 97f 🤗 Huggingface 🤖 ModelScope
5B-540P 544 * 960 * 97f Coming Soon
5B-720P 720 * 1280 * 121f Coming Soon
14B-540P 544 * 960 * 97f 🤗 Huggingface 🤖 ModelScope
14B-720P 720 * 1280 * 121f 🤗 Huggingface 🤖 ModelScope
Camera Director 5B-540P 544 * 960 * 97f Coming Soon
5B-720P 720 * 1280 * 121f Coming Soon
14B-720P 720 * 1280 * 121f Coming Soon

After downloading, set the model path in your generation commands:

Single GPU Inference

  • Diffusion Forcing for Long Video Generation

The Diffusion Forcing version model allows us to generate Infinite-Length videos. This model supports both text-to-video (T2V) and image-to-video (I2V) tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first.

synchronous generation for 10s video

model_id=Skywork/SkyReels-V2-DF-14B-540P
# synchronous inference
python3 generate_video_df.py \
  --model_id ${model_id} \
  --resolution 540P \
  --ar_step 0 \
  --base_num_frames 97 \
  --num_frames 257 \
  --overlap_history 17 \
  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
  --addnoise_condition 20 \
  --offload \
  --teacache \
  --use_ret_steps \
  --teacache_thresh 0.3

asynchronous generation for 30s video

model_id=Skywork/SkyReels-V2-DF-14B-540P
# asynchronous inference
python3 generate_video_df.py \
  --model_id ${model_id} \
  --resolution 540P \
  --ar_step 5 \
  --causal_block_size 5 \
  --base_num_frames 97 \
  --num_frames 737 \
  --overlap_history 17 \
  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
  --addnoise_condition 20 \
  --offload

Text-to-video with diffusers:

import torch
from diffusers import AutoModel, SkyReelsV2DiffusionForcingPipeline, UniPCMultistepScheduler
from diffusers.utils import export_to_video

vae = AutoModel.from_pretrained("Skywork/SkyReels-V2-DF-14B-540P-Diffusers", subfolder="vae", torch_dtype=torch.float32)

pipeline = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
    "Skywork/SkyReels-V2-DF-14B-540P-Diffusers",
    vae=vae,
    torch_dtype=torch.bfloat16
)
flow_shift = 8.0  # 8.0 for T2V, 5.0 for I2V
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
pipeline = pipeline.to("cuda")

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipeline(
    prompt=prompt,
    num_inference_steps=30,
    height=544,  # 720 for 720P
    width=960,   # 1280 for 720P
    num_frames=97,
    base_num_frames=97,  # 121 for 720P
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    causal_block_size=5,  # Number of frames in each block for asynchronous processing
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long video generations
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "T2V.mp4", fps=24, quality=8)

Image-to-video with diffusers:

import numpy as np
import torch
import torchvision.transforms.functional as TF
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline, UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image

model_id = "Skywork/SkyReels-V2-DF-14B-720P-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipeline = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
    model_id, vae=vae, torch_dtype=torch.bfloat16
)
flow_shift = 5.0  # 8.0 for T2V, 5.0 for I2V
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
pipeline.to("cuda")

first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

def aspect_ratio_resize(image, pipeline, max_area=720 * 1280):
    aspect_ratio = image.height / image.width
    mod_value = pipeline.vae_scale_factor_spatial * pipeline.transformer.config.patch_size[1]
    height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
    width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
    image = image.resize((width, height))
    return image, height, width

def center_crop_resize(image, height, width):
    # Calculate resize ratio to match first frame dimensions
    resize_ratio = max(width / image.width, height / image.height)

    # Resize the image
    width = round(image.width * resize_ratio)
    height = round(image.height * resize_ratio)
    size = [width, height]
    image = TF.center_crop(image, size)

    return image, height, width

first_frame, height, width = aspect_ratio_resize(first_frame, pipeline)
if last_frame.size != first_frame.size:
    last_frame, _, _ = center_crop_resize(last_frame, height, width)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

output = pipeline(
    image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.0
).frames[0]
export_to_video(output, "output.mp4", fps=24, quality=8)

Note: - If you want to run the image-to-video (I2V) task, add --image ${image_path} to your command and it is also better to use text-to-video (T2V)-like prompt which includes some descriptions of the first-frame image. - For long video generation, you can just switch the --num_frames, e.g

Core symbols most depended-on inside this repo

to
called by 106
skyreels_v2_infer/modules/vae.py
encode
called by 12
skyreels_v2_infer/modules/vae.py
_sigma_to_alpha_sigma_t
called by 8
skyreels_v2_infer/scheduler/fm_solvers_unipc.py
set_timesteps
called by 7
skyreels_v2_infer/scheduler/fm_solvers_unipc.py
flash_attention
called by 7
skyreels_v2_infer/modules/attention.py
step
called by 6
skyreels_v2_infer/scheduler/fm_solvers_unipc.py
fp16_clamp
called by 5
skyreels_v2_infer/modules/t5.py
decode
called by 5
skyreels_v2_infer/modules/vae.py

Shape

Method 152
Class 53
Function 45

Languages

Python100%

Modules by API surface

skyreels_v2_infer/modules/transformer.py45 symbols
skyreels_v2_infer/modules/vae.py40 symbols
skyreels_v2_infer/modules/t5.py37 symbols
skyreels_v2_infer/modules/clip.py32 symbols
skyreels_v2_infer/scheduler/fm_solvers_unipc.py19 symbols
skyreels_v2_infer/modules/xlm_roberta.py10 symbols
skyreels_v2_infer/pipelines/diffusion_forcing_pipeline.py9 symbols
skycaptioner_v1/scripts/vllm_struct_caption.py8 symbols
skyreels_v2_infer/modules/tokenizers.py7 symbols
skyreels_v2_infer/distributed/xdit_context_parallel.py7 symbols
skycaptioner_v1/scripts/vllm_fusion_caption.py6 symbols
skyreels_v2_infer/modules/__init__.py5 symbols

Dependencies from manifests, versioned

accelerate1.6.0 · 1×
decord0.6.0 · 1×
diffusers0.31.0 · 1×
numpy1.23.5 · 1×
opencv-python4.10.0.84 · 1×
tokenizers0.21.1 · 1×
torch2.5.1 · 1×
torchvision0.20.1 · 1×
transformers4.49.0 · 1×
vllm0.8.4 · 1×

For agents

$ claude mcp add SkyReels-V2 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact