MCPcopy
hub / github.com/showlab/Show-o

github.com/showlab/Show-o @main sqlite

repository ↗ · DeepWiki ↗
910 symbols 3,251 edges 69 files 260 documented · 29%
README

One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie1*  Weijia Mao1*  Zechen Bai1*  David Junhao Zhang1* 

Weihao Wang2  Kevin Qinghong Lin1  Yuchao Gu1 Zhijie Chen2  Zhenheng Yang2  Mike Zheng Shou1

1 Show Lab, National University of Singapore  2 Bytedance 

ArXiv ArXiv Demo slack badge WeChat badge

Improved Native Unified Multimodal Models

Jinheng Xie1  Zhenheng Yang2  Mike Zheng Shou1

1 Show Lab, National University of Singapore  2 Bytedance 

ArXiv ArXiv WeChat badge

News

  • [2025-09-18] Arxiv update to include video understanding, OneIG, and more ablation study results.
  • [2025-09-18] Show-o2 has been accepted to NeurIPS 2025.
  • [2025-09-05] Release the 1.5B and 7B models with video understanding capability.
  • [2025-07-05] Fix some issues related to visualization of generated images during training.
  • [2025-07-05] We release the training and inference code for simple mixed-modality generation.
  • [2025-06-27] We release the training code for multimodal understanding and generation.
  • [2025-06-25] We thank team OneIG-Bench for evaluating Show-o2 models on their new benchmark, in which our models have achieved leading performance in terms of Alignment and Reasoning metrics. The leaderboard is maintained here.

  • [2025-06-20] We are including more concurrent works in our comparative analysis tables. Feel free to reach out to us if we miss your works.

  • [2025-06-19] We release the Show-o2 models with 1.5B and 7B LLM parameters for multimodal understanding and generation. We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation.

GIF 1 GIF 2
GIF 1 GIF 2 GIF 3 GIF 4
  • [2025-01-23] Show-o has been accepted to ICLR 2025.
  • [2024-10-15] Update Arxiv paper to include new features and experimental results.
  • Support image generation in a resolution of 512x512.

  • Improve the multimodal understanding capabilities of purely discrete Show-o.

  • Improve the performance on the GenEval benchmark.

  • Explore the impact of dataset scale and image resolution on multimodal understanding capabilities of discrete image tokens. For more information, please refer to the paper.

  • We release the weight of Show-o before fine-tuning on LLaVA instructional tuning datasets. You can fine-tune it following the configurations in ./configs.

  • [2024-09-12] Arxiv paper updated to include preliminaries about discrete diffusion.

  • [2024-09-03] We deploy an online demo on Hugging Face Space. 🤗 Have fun!
  • [2024-09-02] We release the training code for pre-training and instruction tuning! 🔥🔥
  • [2024-09-01] Add FlexAttention implementation for accleration. Thanks to @Horace for providing examples.
  • [2024-08-28] We maintain a repo of Awesome Unified Multimodal Models. If you are interested in unified models, star and watch it to get latest updates!
  • [2024-08-27] Add integration to Hugging Face! Thanks to @NielsRogge.
  • [2024-08-26] We build two community platforms to facilitate discussion, request and collaboration! Reach us with Discord and WeChat!
  • [2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpainting and extrapolation.

What is the new about Show-o?

Below is a characteristics comparison among understanding only, generation only, and unified (understanding \& generation) models. Vision and Language indicate the representations from specific input modalities. In this context, Diffusion represents both continuous and discrete diffusion.

Below is an overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.

TODO

  • [X] Release the inference code.
  • [X] Release the training code.
  • [X] Support image generation in a resolution of 512x512.
  • [ ] Scale up the model size (based on LLaMA3) and increase the number of training data.

Hugging Face models and annotations

The Show-o2 checkpoints can be found on Hugging Face: * showlab/show-o2-1.5B * showlab/show-o2-1.5B-HQ (text-to-image generation in resolutions of 512x512 and 1024x1024 with better text rendering) * showlab/show-o2-7B * showlab/show-o2-1.5B (further unified fine-tuning on video understanding data) * showlab/show-o2-7B (further unified fine-tuning on video understanding data)

The Show-o checkpoints can be found on Hugging Face: * showlab/show-o-512x512 * showlab/show-o-w-clip-vit-512x512 * showlab/show-o-512x512-wo-llava-tuning * showlab/show-o * showlab/show-o-w-clip-vit * showlab/magvitv2 * Journeydb-Annotation

Getting Started

First, set up the environment:

pip3 install -r requirements.txt

Login your wandb account on your machine or server.

wandb login <your wandb keys>

Inference demo for Multimodal Understanding and you can view the results on wandb.

option (c)

python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'

or option (a)

python3 inference_mmu.py config=configs/showo_demo_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'

Inference demo for Text-to-Image Generation and you can view the results (in a resolution of 512x512) on wandb.

python3 inference_t2i.py config=configs/showo_demo_512x512.yaml \
batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=5 generation_timesteps=50 \
mode='t2i'

Inference demo for Text-guided Inpainting and you can view the results (in a resolution of 256x256) on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp

Inference demo for Text-guided Extrapolation and you can view the results (in a resolution of 256x256) on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg

Training pipeline

Prepare your training data and change the data path in configs/xx.yaml.

Note that, our training process is based on accelerate. Please ensure to config your accelerate for distributed training. We provide config examples below for (distributed) training on a single GPU or multiple GPUs.

├── accelerate_configs/ 
|   ├── multi_nodes (6x8 GPUs)
|   |   ├—— ...
|   ├── 1_gpu.yaml
|   └── 8_gpu_deepspeed_zero2.yaml

Stage 1 - Pre-training on ImageNet-1K dataset. Change the data path to ImageNet-1K in configs/showo_pretraining_stage1.yaml. **Note that, we use the internal packages to process the RefinedWeb dataset, and you must manually comment the code part related to language modeling

Core symbols most depended-on inside this repo

to
called by 389
models/training_utils.py
batch_decode
called by 26
show-o2/models/wan21_vae.py
from_pretrained
called by 25
models/training_utils.py
omni_attn_mask_naive
called by 23
show-o2/models/omni_attention.py
from_pretrained
called by 17
show-o2/models/modeling_utils.py
load_state_dict
called by 16
models/training_utils.py
sample
called by 15
show-o2/models/wan21_vae.py
update
called by 14
show-o2/utils.py

Shape

Method 499
Function 267
Class 141
Route 3

Languages

Python100%

Modules by API surface

show-o2/models/modeling_siglip.py65 symbols
show-o2/models/qwen2.py59 symbols
models/phi.py59 symbols
show-o2/models/modules.py49 symbols
show-o2/models/wan21_vae.py41 symbols
models/common_modules.py37 symbols
show-o2/models/my_logging.py36 symbols
models/logging.py36 symbols
show-o2/models/modeling_utils.py32 symbols
models/modeling_utils.py32 symbols
show-o2/transport/transport.py31 symbols
show-o2/transport/path.py26 symbols

Dependencies from manifests, versioned

Deprecated1.2.14 · 1×
GitPython3.1.43 · 1×
Jinja23.1.4 · 1×
MarkupSafe2.1.5 · 1×
PasteDeploy3.1.0 · 1×
PyJWT2.8.0 · 1×
PyYAML6.0.1 · 1×
Pygments2.18.0 · 1×
SQLAlchemy2.0.30 · 1×
Shapely1.8.5.post1 · 1×
WTForms3.1.2 · 1×
WebOb1.8.7 · 1×

For agents

$ claude mcp add Show-o \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact