English | 简体中文


[2026.05.15] Release SenseNova-U1-8B-MoT-Infographic 📊 model for improved infographic generation. See U1 Infographic Model for details, and ✨ Infographic Showcases for 100 generated examples.
[2026.05.10] Release 🔥SenseNova-U1 Technical Report🔥 and the weights for SenseNova-U1-A3B-MoT-SFT & SenseNova-U1-A3B-MoT.
[2026.05.08] Add GGUF quantized checkpoints and layer-offload VRAM modes for low-VRAM single-GPU inference. See Memory-efficient inference. GGUF weights for SenseNova-U1-8B-MoT-Merger are available at 🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf — many thanks to @smthem for contributing the quantized weights.
[2026.05.06] Release SenseNova-U1-8B-MoT-LoRA-8step-V1.0. Please see the example script.
[2026.04.30] Release the preview version of the 8-step inference model SenseNova-U1-8B-MoT-8step-preview. In most cases, the image generation quality of this model closely matches that of the base model (see comparison and existing issues). To test this model, you can use the inference scripts, but with the following parameters: --cfg_scale 1.0 --num_steps 8 .
[2026.04.27] Initial release of the weights for SenseNova-U1-8B-MoT-SFT and SenseNova-U1-8B-MoT.
[2026.04.27] Initial release of the inference code for SenseNova-U1.
🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental paradigm shift in multimodal AI: from modality integration to true unification. Rather than relying on adapters to translate between modalities, SenseNova U1 models think-and-act across language and vision natively.
Unifying visual understanding and generation in an end-to-end architecture from pixel to word opens tremendous possibilities, enabling highly efficient and strong understanding, generation, and interleaved reasoning in a natively multimodal manner.

At the core of SenseNova U1 is NEO-unify, a novel architecture designed from the first principles for multimodal AI: It eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE) where pixel-word information are inherently and deeply correlated. Several important features are as follows:
Powered by this new core architecture, SenseNova U1 delivers exceptional efficiency in multimodal learning:

Left: Generation Latency vs. Averaging Performance on OneIG (EN, ZH), LongText (EN, ZH), BizGenEval (Easy, Hard), CVTG and IGenBench.
Right: Generation Latency vs. Averaging Performance on Infographic Benchmarks, i.e., BizGenEval (Easy, Hard), and IGenBench.
🏆 Open-source SoTA in both understanding and generation: SenseNova U1 sets a new standard for unified multimodal understanding and generation, achieving state-of-the-art performance among open-source models across a wide range of understanding, reasoning, and generation benchmarks.
📖 Native interleaved image-text generation: SenseNova U1 can generate coherent interleaved text and images in a single flow with one model, enabling use cases such as practical guides and travel diaries that combine clear communication with vivid storytelling and transform complex information into intuitive visuals.
📰 High-density information rendering: SenseNova U1 demonstrates strong capabilities in dense visual communication, generating richly structured layouts for knowledge illustrations, posters, presentations, comics, resumes, and other information-rich formats.
In this release, we are open-sourcing the SenseNova U1 Lite series in two sizes:
| Model | Params | HF Weights |
|---|---|---|
| SenseNova-U1-8B-MoT-Infographic | 8B MoT | 🤗 link |
| SenseNova-U1-8B-MoT-SFT | 8B MoT | 🤗 link |
| SenseNova-U1-8B-MoT | 8B MoT | 🤗 link |
| SenseNova-U1-8B-MoT-LoRA-8step-V1.0 | 0.4B | 🤗 link |
| SenseNova-U1-A3B-MoT-SFT | A3B MoT | 🤗 link |
| SenseNova-U1-A3B-MoT | A3B MoT | 🤗 link |
Here SFT models (×32 downsampling ratio) are trained via Understanding Warmup, Generation Pre-training, Unified Mid-training, and Unified SFT, with final models obtained after an initial round of T2I RL training.
Although relatively compact by today’s standards, these models already show strong performance across diverse tasks, comparable to commercial models with excellent cost efficiency. Notably, larger-scale versions are planned to further enhance capability and performance in the future.
💡 The
8B-MoTinSenseNova-U1-8B-MoTrefers to ~8B understanding parameters and ~8B generation parameters. See parameter breakdown for details.
[ ] Training code of SenseNova-U1
[x] Final weights and technical report of SenseNova-U1
🖼️ Text-to-Image (General)
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
🖼️ Text-to-Image (Reasoning)
| Original Text | Reasoning Process | Resulting Image |
|---|---|---|
| A male peacock trying to attract a female | 1. Instruction Understanding: The core subject is a single male peacock. The action involves the male peacock trying to attract a female. 2. Reasoning Process: Male peacocks perform this ritual by fanning their tail feathers into a massive, colorful circular display. 3. Establish the frame: The frame captures the male peacock in a full-body shot to show the entire display. The angle is eye-level to emphasize the height of the tail feathers. 4. Set the lighting and color: Natural lighting illuminates the colorful circular display with soft shadows. The palette features vibrant blues and greens on the tail feathers. 5. Lock the style: The style is a realistic photograph with sharp focus on the male peacock. The finish is natural without artistic filters. 6. Explicit Prompt: A vibrant male peacock with his tail feathers fully fanned out in a wide, iridescent display of blues and greens, standing in a garden. | ![]() |
| A small piece of dry wood and a dense iron block are in a transparent water tank | 1. Instruction Understanding: The core subjects are a small piece of dry wood and a dense iron block positioned within a transparent water tank. The wood is floating on the surface while the iron block is submerged at the bottom. 2. Reasoning Process: The wood is less dense than water so it will float, while the iron is denser and will sink to the bottom. 3. Establish the frame: The composition is a vertical medium shot centering the rectangular tank within the frame. The camera angle is eye-level to clearly display the water line and the submerged base. Focus is sharp across the entire depth of the tank to ensure both materials are distinct. 4. Build the environment: The scene is contained entirely within the clear glass walls of the water tank. The water fills the majority of the volume, providing a medium for the floating wood and sunken iron block. The background remains out of focus to keep attention on the tan |
$ claude mcp add SenseNova-U1 \
-- python -m otcore.mcp_server <graph>