hub / github.com/Tongyi-MAI/Z-Image

github.com/Tongyi-MAI/Z-Image @main sqlite

126 symbols 468 edges 18 files 18 documented · 14%

README

⚡️- Image _{^{An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer}}

Welcome to the official repository for the Z-Image（造相）project!

✨ Z-Image

Z-Image is a powerful and highly efficient image generation model family with 6B parameters. Currently there are four variants:

🚀 Z-Image-Turbo – A distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers ⚡️sub-second inference latency⚡️ on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.
🎨 Z-Image – The foundation model behind Z-Image-Turbo. Z-Image focuses on high-quality generation, rich aesthetics, strong diversity, and controllability, well-suited for creative generation, fine-tuning, and downstream development. It supports a wide range of artistic styles, effective negative prompting, and high diversity across identities, poses, compositions, and layouts.
🧱 Z-Image-Omni-Base – The versatile foundation model capable of both generation and editing tasks. By releasing this checkpoint, we aim to unlock the full potential for community-driven fine-tuning and custom development, providing the most "raw" and diverse starting point for the open-source community.
✍️ Z-Image-Edit – A variant fine-tuned on Z-Image specifically for image editing tasks. It supports creative image-to-image generation with impressive instruction-following capabilities, allowing for precise edits based on natural language prompts.

📣 News

[2026-01-27] 🔥 Z-Image is released! We have released the model checkpoint on Hugging Face and ModelScope. Try our online demo!
[2025-12-08] 🏆 Z-Image-Turbo ranked 8th overall on the Artificial Analysis Text-to-Image Leaderboard, making it the 🥇 #1 open-source model! Check out the full leaderboard.
[2025-12-01] 🎉 Our technical report for Z-Image is now available on arXiv.
[2025-11-26] 🔥 Z-Image-Turbo is released! We have released the model checkpoint on Hugging Face and ModelScope. Try our online demo!

📥 Model Zoo

Model	Pre-Training	SFT	RL	Step	CFG	Task	Visual Quality	Diversity	Fine-Tunability	Hugging Face	ModelScope
Z-Image-Omni-Base	✅	❌	❌	50	✅	Gen. / Editing	Medium	High	Easy	To be released	To be released
Z-Image	✅	✅	❌	50	✅	Gen.	High	Medium	Easy

| | Z-Image-Turbo | ✅ | ✅ | ✅ | 8 | ❌ | Gen. | Very High | Low | N/A |

| | Z-Image-Edit | ✅ | ✅ | ❌ | 50 | ✅ | Editing | High | Medium | Easy | To be released | To be released |

The figure below illustrates at which training stage each model is produced.

Training Pipeline of Z-Image

🖼️ Showcase

📸 Photorealistic Quality: Z-Image-Turbo delivers strong photorealistic image generation while maintaining excellent aesthetic quality.

Showcase of Z-Image on Photo-realistic image Generation

📖 Accurate Bilingual Text Rendering: Z-Image-Turbo excels at accurately rendering complex Chinese and English text.

Showcase of Z-Image on Bilingual Text Rendering

💡 Prompt Enhancing & Reasoning: Prompt Enhancer empowers the model with reasoning capabilities, enabling it to transcend surface-level descriptions and tap into underlying world knowledge.

🧠 Creative Image Editing: Z-Image-Edit shows a strong understanding of bilingual editing instructions, enabling imaginative and flexible image transformations.

Showcase of Z-Image-Edit on Image Editing

🏗️ Model Architecture

We adopt a Scalable Single-Stream DiT (S3-DiT) architecture. In this setup, text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches.

Architecture of Z-Image and Z-Image-Edit

📈 Performance

Z-Image-Turbo's performance has been validated on multiple independent benchmarks, where it consistently demonstrates state-of-the-art results, especially as the leading open-source model.

Artificial Analysis Text-to-Image Leaderboard

On the highly competitive Artificial Analysis Leaderboard, Z-Image-Turbo ranked 8th overall and secured the top position as the 🥇 #1 Open-Source Model, outperforming all other open-source alternatives.

<span style="font-size:1.05em; cursor:pointer; text-decoration:underline;"> Artificial Analysis Leaderboard</span>

<span style="font-size:1.05em; cursor:pointer; text-decoration:underline;"> Artificial Analysis Leaderboard (Open-Source Model Only)</span>

Alibaba AI Arena Text-to-Image Leaderboard

According to the Elo-based Human Preference Evaluation on Alibaba AI Arena, Z-Image-Turbo also achieves state-of-the-art results among open-source models and shows highly competitive performance against leading proprietary models.

<span style="font-size:1.05em; cursor:pointer; text-decoration:underline;"> Alibaba AI Arena Text-to-Image Leaderboard</span>

🚀 Quick Start

(1) PyTorch Native Inference

Build a virtual environment you like and then install the dependencies:

pip install -e .

Then run the following code to generate an image:

python inference.py

(2) Diffusers Inference

Install the latest version of diffusers, use the following command:

Click here for details for why you need to install diffusers from source

We have submitted two pull requests (#12703 and #12715) to the 🤗 diffusers repository to add support for Z-Image. Both PRs have been merged into the latest official diffusers release. Therefore, you need to install diffusers from source for the latest features and Z-Image support.

pip install git+https://github.com/huggingface/diffusers

Z-Image-Turbo - Click to expand

Then, try the following code to generate an image:

import torch
from diffusers import ZImagePipeline

# 1. Load the pipeline
# Use bfloat16 for optimal performance on supported GPUs
pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

# [Optional] Attention Backend
# Diffusers uses SDPA by default. Switch to Flash Attention for better efficiency if supported:
# pipe.transformer.set_attention_backend("flash")    # Enable Flash-Attention-2
# pipe.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3

# [Optional] Model Compilation
# Compiling the DiT model accelerates inference, but the first run will take longer to compile.
# pipe.transformer.compile()

# [Optional] CPU Offloading
# Enable CPU offloading for memory-constrained devices.
# pipe.enable_model_cpu_offload()

prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights."

# 2. Generate Image
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # This actually results in 8 DiT forwards
    guidance_scale=0.0,     # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("example.png")

Z-Image - Click to expand

Recommended Parameters: - Resolution: 512×512 to 2048×2048 (total pixel area, any aspect ratio) - Guidance scale: 3.0 – 5.0 - Inference steps: 28 – 50 - Negative prompts: Strongly recommended for better control - CFG normalization: False for general stylism, True for realism

Then, try the following code to generate an image: ```python import torch from diffusers import ZImagePipeline

Load the pipeline

pipe = ZImagePipeline.from_pretrained( "Tongyi-MAI/Z-Image", torch_dtype=torch.bfloat16, low_cpu_mem_usage=False, ) pipe.to("cuda")

Generate image

prompt = "两名年轻亚裔女性紧密站在一起，背景为朴素的灰色纹理墙面，可能是室内地毯地面。左侧女性留着长卷发，身穿藏青色毛衣，左袖有奶油色褶皱装饰，内搭白色立领衬衫，下身白色裤子；佩戴小巧金色耳钉，双臂交叉于背后。右侧女性留直肩长发，身穿奶油色卫衣，胸前印有"Tun the tables"字样，下方为"New ideas"，搭配白色裤子；佩戴银色小环耳环，双臂交叉于胸前。两人均面带微笑直视镜头。照

Core symbols most depended-on inside this repo

get

called by 40

src/zimage/scheduler.py

src/zimage/scheduler.py

create_coordinate_grid

called by 3

src/zimage/transformer.py

load_config

called by 3

src/utils/loader.py

_native_attention_wrapper

called by 3

src/utils/attention.py

_sigma_to_t

called by 2

src/zimage/scheduler.py

generate

called by 2

src/zimage/pipeline.py

Shape

Method 56

Function 46

Class 24

Languages

Python100%

Modules by API surface

src/zimage/autoencoder.py37 symbols

src/zimage/transformer.py30 symbols

src/utils/attention.py23 symbols

src/zimage/scheduler.py13 symbols

src/utils/helpers.py6 symbols

batch_inference.py4 symbols

src/zimage/pipeline.py3 symbols

src/utils/loader.py3 symbols

src/utils/import_utils.py3 symbols

src/tools/generate_manifest.py3 symbols

inference.py1 symbols

Dependencies from manifests, versioned

accelerate1×

huggingface_hub0.25.0 · 1×

loguru1×

pillow1×

safetensors1×

torch2.5.0 · 1×

transformers4.51.0 · 1×

For agents

$ claude mcp add Z-Image \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/Tongyi-MAI/Z-Image @main sqlite

⚡️- Image An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

✨ Z-Image

📣 News

📥 Model Zoo

🖼️ Showcase

🏗️ Model Architecture

📈 Performance

Artificial Analysis Text-to-Image Leaderboard

Alibaba AI Arena Text-to-Image Leaderboard

🚀 Quick Start

(1) PyTorch Native Inference

(2) Diffusers Inference

Load the pipeline

Generate image

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents

⚡️- Image _{^{An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer}}