hub / github.com/showlab/Show-o

github.com/showlab/Show-o @main sqlite

repository ↗ · DeepWiki ↗

910 symbols 3,251 edges 69 files 260 documented · 29%

README

One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie^1* Weijia Mao^1* Zechen Bai^1* David Junhao Zhang^1*

Weihao Wang² Kevin Qinghong Lin¹ Yuchao Gu¹ Zhijie Chen² Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

Improved Native Unified Multimodal Models

Jinheng Xie¹ Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

News

[2025-09-18] Arxiv update to include video understanding, OneIG, and more ablation study results.
[2025-09-18] Show-o2 has been accepted to NeurIPS 2025.
[2025-09-05] Release the 1.5B and 7B models with video understanding capability.
[2025-07-05] Fix some issues related to visualization of generated images during training.
[2025-07-05] We release the training and inference code for simple mixed-modality generation.
[2025-06-27] We release the training code for multimodal understanding and generation.
[2025-06-25] We thank team OneIG-Bench for evaluating Show-o2 models on their new benchmark, in which our models have achieved leading performance in terms of Alignment and Reasoning metrics. The leaderboard is maintained here.

[2025-06-20] We are including more concurrent works in our comparative analysis tables. Feel free to reach out to us if we miss your works.
[2025-06-19] We release the Show-o2 models with 1.5B and 7B LLM parameters for multimodal understanding and generation. We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation.

[2025-01-23] Show-o has been accepted to ICLR 2025.
[2024-10-15] Update Arxiv paper to include new features and experimental results.
Support image generation in a resolution of 512x512.

Improve the multimodal understanding capabilities of purely discrete Show-o.

Improve the performance on the GenEval benchmark.

Explore the impact of dataset scale and image resolution on multimodal understanding capabilities of discrete image tokens. For more information, please refer to the paper.

We release the weight of Show-o before fine-tuning on LLaVA instructional tuning datasets. You can fine-tune it following the configurations in ./configs.
[2024-09-12] Arxiv paper updated to include preliminaries about discrete diffusion.
[2024-09-03] We deploy an online demo on Hugging Face Space. 🤗 Have fun!
[2024-09-02] We release the training code for pre-training and instruction tuning! 🔥🔥
[2024-09-01] Add FlexAttention implementation for accleration. Thanks to @Horace for providing examples.
[2024-08-28] We maintain a repo of Awesome Unified Multimodal Models. If you are interested in unified models, star and watch it to get latest updates!
[2024-08-27] Add integration to Hugging Face! Thanks to @NielsRogge.
[2024-08-26] We build two community platforms to facilitate discussion, request and collaboration! Reach us with Discord and WeChat!
[2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpainting and extrapolation.

What is the new about Show-o?

Below is a characteristics comparison among understanding only, generation only, and unified (understanding \& generation) models. Vision and Language indicate the representations from specific input modalities. In this context, Diffusion represents both continuous and discrete diffusion.

Below is an overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.

TODO

[X] Release the inference code.
[X] Release the training code.
[X] Support image generation in a resolution of 512x512.
[ ] Scale up the model size (based on LLaMA3) and increase the number of training data.

Hugging Face models and annotations

The Show-o2 checkpoints can be found on Hugging Face: * showlab/show-o2-1.5B * showlab/show-o2-1.5B-HQ (text-to-image generation in resolutions of 512x512 and 1024x1024 with better text rendering) * showlab/show-o2-7B * showlab/show-o2-1.5B (further unified fine-tuning on video understanding data) * showlab/show-o2-7B (further unified fine-tuning on video understanding data)

The Show-o checkpoints can be found on Hugging Face: * showlab/show-o-512x512 * showlab/show-o-w-clip-vit-512x512 * showlab/show-o-512x512-wo-llava-tuning * showlab/show-o * showlab/show-o-w-clip-vit * showlab/magvitv2 * Journeydb-Annotation

Getting Started

First, set up the environment:

pip3 install -r requirements.txt

wandb login <your wandb keys>

Inference demo for Multimodal Understanding and you can view the results on wandb.

option (c)

python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'

or option (a)

python3 inference_mmu.py config=configs/showo_demo_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'

Inference demo for Text-to-Image Generation and you can view the results (in a resolution of 512x512) on wandb.

python3 inference_t2i.py config=configs/showo_demo_512x512.yaml \
batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=5 generation_timesteps=50 \
mode='t2i'

Inference demo for Text-guided Inpainting and you can view the results (in a resolution of 256x256) on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp

Inference demo for Text-guided Extrapolation and you can view the results (in a resolution of 256x256) on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg

Training pipeline

Prepare your training data and change the data path in configs/xx.yaml.

Note that, our training process is based on accelerate. Please ensure to config your accelerate for distributed training. We provide config examples below for (distributed) training on a single GPU or multiple GPUs.

├── accelerate_configs/ 
|   ├── multi_nodes (6x8 GPUs)
|   |   ├—— ...
|   ├── 1_gpu.yaml
|   └── 8_gpu_deepspeed_zero2.yaml

Stage 1 - Pre-training on ImageNet-1K dataset. Change the data path to ImageNet-1K in configs/showo_pretraining_stage1.yaml. **Note that, we use the internal packages to process the RefinedWeb dataset, and you must manually comment the code part related to language modeling

Core symbols most depended-on inside this repo

called by 389

models/training_utils.py

batch_decode

called by 26

show-o2/models/wan21_vae.py

from_pretrained

called by 25

models/training_utils.py

omni_attn_mask_naive

called by 23

show-o2/models/omni_attention.py

from_pretrained

called by 17

show-o2/models/modeling_utils.py

load_state_dict

called by 16

models/training_utils.py

sample

called by 15

show-o2/models/wan21_vae.py

update

called by 14

show-o2/utils.py

Shape

Method 499

Function 267

Class 141

Route 3

Languages

Python100%

Modules by API surface

show-o2/models/modeling_siglip.py65 symbols

show-o2/models/qwen2.py59 symbols

models/phi.py59 symbols

show-o2/models/modules.py49 symbols

show-o2/models/wan21_vae.py41 symbols

models/common_modules.py37 symbols

show-o2/models/my_logging.py36 symbols

models/logging.py36 symbols

show-o2/models/modeling_utils.py32 symbols

models/modeling_utils.py32 symbols

show-o2/transport/transport.py31 symbols

show-o2/transport/path.py26 symbols

Dependencies from manifests, versioned

Deprecated1.2.14 · 1×

GitPython3.1.43 · 1×

Jinja23.1.4 · 1×

MarkupSafe2.1.5 · 1×

PasteDeploy3.1.0 · 1×

PyJWT2.8.0 · 1×

PyYAML6.0.1 · 1×

Pygments2.18.0 · 1×

SQLAlchemy2.0.30 · 1×

Shapely1.8.5.post1 · 1×

WTForms3.1.2 · 1×

WebOb1.8.7 · 1×

For agents

$ claude mcp add Show-o \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact