MCPcopy Index your code
hub / github.com/OpenSenseNova/SenseNova-U1

github.com/OpenSenseNova/SenseNova-U1 @comfyui-v0.1.4 sqlite

repository ↗ · DeepWiki ↗ · release comfyui-v0.1.4 ↗
857 symbols 2,813 edges 76 files 264 documented · 31%
README

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

English | 简体中文

arXiv HuggingFace Model ModelScope-模型 SenseNova-U1 Demo License Discord

SenseNova-U1

visualization

📣 Updated News

🌟 Overview

🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental paradigm shift in multimodal AI: from modality integration to true unification. Rather than relying on adapters to translate between modalities, SenseNova U1 models think-and-act across language and vision natively.

Unifying visual understanding and generation in an end-to-end architecture from pixel to word opens tremendous possibilities, enabling highly efficient and strong understanding, generation, and interleaved reasoning in a natively multimodal manner.

radar plot

🏗️ Key Pillars:

At the core of SenseNova U1 is NEO-unify, a novel architecture designed from the first principles for multimodal AI: It eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE) where pixel-word information are inherently and deeply correlated. Several important features are as follows:

  • 🔗 Model language and visual information end-to-end as a unified compound.
  • 🖼️ Preserve semantic richness while maintaining pixel-level visual fidelity.
  • 🧠 Reason across modalities with high efficiency & minimal conflict via native MoTs.

What This Unlocks:

Powered by this new core architecture, SenseNova U1 delivers exceptional efficiency in multimodal learning:

Left: Generation Latency vs. Averaging Performance on OneIG (EN, ZH), LongText (EN, ZH), BizGenEval (Easy, Hard), CVTG and IGenBench.

Right: Generation Latency vs. Averaging Performance on Infographic Benchmarks, i.e., BizGenEval (Easy, Hard), and IGenBench.

  • 🏆 Open-source SoTA in both understanding and generation: SenseNova U1 sets a new standard for unified multimodal understanding and generation, achieving state-of-the-art performance among open-source models across a wide range of understanding, reasoning, and generation benchmarks.

  • 📖 Native interleaved image-text generation: SenseNova U1 can generate coherent interleaved text and images in a single flow with one model, enabling use cases such as practical guides and travel diaries that combine clear communication with vivid storytelling and transform complex information into intuitive visuals.

  • 📰 High-density information rendering: SenseNova U1 demonstrates strong capabilities in dense visual communication, generating richly structured layouts for knowledge illustrations, posters, presentations, comics, resumes, and other information-rich formats.

🌍 Beyond Multimodality:

  • 🤖 Vision–Language–Action (VLA)
  • 🌐 World Modeling (WM)

🦁 Models

In this release, we are open-sourcing the SenseNova U1 Lite series in two sizes:

  • SenseNova U1-8B-MoT — dense backbone
  • SenseNova U1-A3B-MoT — MoE backbone
Model Params HF Weights
SenseNova-U1-8B-MoT-Infographic 8B MoT 🤗 link
SenseNova-U1-8B-MoT-SFT 8B MoT 🤗 link
SenseNova-U1-8B-MoT 8B MoT 🤗 link
SenseNova-U1-8B-MoT-LoRA-8step-V1.0 0.4B 🤗 link
SenseNova-U1-A3B-MoT-SFT A3B MoT 🤗 link
SenseNova-U1-A3B-MoT A3B MoT 🤗 link

Here SFT models (×32 downsampling ratio) are trained via Understanding Warmup, Generation Pre-training, Unified Mid-training, and Unified SFT, with final models obtained after an initial round of T2I RL training.

Although relatively compact by today’s standards, these models already show strong performance across diverse tasks, comparable to commercial models with excellent cost efficiency. Notably, larger-scale versions are planned to further enhance capability and performance in the future.

💡 The 8B-MoT in SenseNova-U1-8B-MoT refers to ~8B understanding parameters and ~8B generation parameters. See parameter breakdown for details.

📋 ToDo List

  • [ ] Training code of SenseNova-U1

  • [x] Final weights and technical report of SenseNova-U1

🎨 Showcases

🖼️ Text-to-Image (General)

t2i general dense face hd 07 t2i general dense text rendering 18 t2i general dense text rendering 12
t2i general face hd 13 t2i general face hd 17 t2i general face hd 07
t2i general landscape 06 t2i general dense landscape 12 t2i general landscape 07
t2i general portrait artistic 02 a t2i general portrait artistic 02 b t2i general portrait artistic 07
t2i general portrait text rendering 02 t2i general portrait dense landscape 05 t2i general portrait dense artistic 11

🖼️ Text-to-Image (Reasoning)

Original Text Reasoning Process Resulting Image
A male peacock trying to attract a female 1. Instruction Understanding: The core subject is a single male peacock. The action involves the male peacock trying to attract a female. 2. Reasoning Process: Male peacocks perform this ritual by fanning their tail feathers into a massive, colorful circular display. 3. Establish the frame: The frame captures the male peacock in a full-body shot to show the entire display. The angle is eye-level to emphasize the height of the tail feathers. 4. Set the lighting and color: Natural lighting illuminates the colorful circular display with soft shadows. The palette features vibrant blues and greens on the tail feathers. 5. Lock the style: The style is a realistic photograph with sharp focus on the male peacock. The finish is natural without artistic filters. 6. Explicit Prompt: A vibrant male peacock with his tail feathers fully fanned out in a wide, iridescent display of blues and greens, standing in a garden.
A small piece of dry wood and a dense iron block are in a transparent water tank 1. Instruction Understanding: The core subjects are a small piece of dry wood and a dense iron block positioned within a transparent water tank. The wood is floating on the surface while the iron block is submerged at the bottom. 2. Reasoning Process: The wood is less dense than water so it will float, while the iron is denser and will sink to the bottom. 3. Establish the frame: The composition is a vertical medium shot centering the rectangular tank within the frame. The camera angle is eye-level to clearly display the water line and the submerged base. Focus is sharp across the entire depth of the tank to ensure both materials are distinct. 4. Build the environment: The scene is contained entirely within the clear glass walls of the water tank. The water fills the majority of the volume, providing a medium for the floating wood and sunken iron block. The background remains out of focus to keep attention on the tan

Core symbols most depended-on inside this repo

info
called by 23
apps/comfyui/local_pipeline.py
tqdm
called by 19
examples/vqa/inference.py
write
called by 19
evaluation/gen/tiif/eval/summary_dimension_results.py
_t2i_predict_v
called by 17
src/sensenova_u1/models/neo_unify/modeling_neo_chat.py
append_message
called by 16
src/sensenova_u1/models/neo_unify/conversation.py
from_pretrained
called by 15
src/sensenova_u1/models/neo_unify/configuration_neo_vit.py
_build_t2i_image_indexes
called by 13
src/sensenova_u1/models/neo_unify/modeling_neo_chat.py
prepare_flash_kv_cache
called by 11
src/sensenova_u1/models/neo_unify/modeling_neo_chat.py

Shape

Function 497
Method 269
Class 87
Route 4

Languages

Python99%
TypeScript1%

Modules by API surface

src/sensenova_u1/models/neo_unify/modeling_fm_modules.py58 symbols
evaluation/interleave/OpenING/infer_opening.py45 symbols
src/sensenova_u1/models/neo_unify/modeling_qwen3.py44 symbols
src/sensenova_u1/models/neo_unify/modeling_neo_chat.py39 symbols
apps/comfyui/nodes.py38 symbols
apps/comfyui/local_pipeline.py37 symbols
evaluation/interleave/Unimmmu/inference_unimmmu.py30 symbols
evaluation/interleave/Realunify/inference_realunify.py29 symbols
src/sensenova_u1/utils/layer_offload.py28 symbols
evaluation/interleave/BabyVision/infer_babyvision.py27 symbols
evaluation/interleave/OpenING/eval_opening.py24 symbols
src/sensenova_u1/utils/profiler.py20 symbols

Dependencies from manifests, versioned

accelerate1.10.1 · 1×
httpx
huggingface-hub0.36.2 · 1×
packaging25.0 · 1×
pillow12.0.0 · 1×
pre-commit4.5.1 · 1×
safetensors0.6.2 · 1×
sentencepiece0.2.1 · 1×
tokenizers0.22.1 · 1×
torch2.8.0 · 1×

For agents

$ claude mcp add SenseNova-U1 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact