hub / github.com/jingyaogong/minimind

github.com/jingyaogong/minimind @main sqlite

166 symbols 822 edges 22 files 1 documented · 1%

README

"The Great Way is Simple"

中文 | English

This open-source project aims to train MiniMind, an ultra-small language model with about 64M parameters, entirely from scratch with only about RMB 3 in cost and 2 hours of training time.
The MiniMind series is intentionally lightweight. The smallest model on the main branch is about $\frac{1}{2700}$ the size of GPT-3, making full training and reproduction feasible even on ordinary personal GPUs.
The project provides a minimalist model architecture and an end-to-end LLM training pipeline, covering MoE, data cleaning, pretraining, Supervised Fine-Tuning (SFT), LoRA, RLHF (DPO), RLAIF (PPO / GRPO / CISPO), Tool Use, Agentic RL, Adaptive Thinking, and Model Distillation.
MiniMind has also been extended to a vision model MiniMind-V, a multimodal Omni model MiniMind-O, a diffusion language model (MiniMind-dLM), and a linear attention model (MiniMind-Linear). See Discussion for details.
All core algorithms are implemented directly in native PyTorch, without relying on high-level abstractions from third-party libraries.
MiniMind is both an end-to-end open-source reproduction of the LLM training pipeline and a hands-on tutorial for learning how LLMs are built.
We hope this project can provide a reproducible, understandable, and extensible starting point for more people, share the joy of creation, and help move the broader AI community forward.

Note: This project is released under the Apache 2.0 license and is completely free. "2 hours" refers to the measured time for running 1 epoch of the SFT stage on a single NVIDIA 3090, while "RMB 3" refers to the corresponding GPU rental cost.

minimind-3

🔗 Online Demo | 🔗 Video Introduction

📌 Project Introduction

The emergence of Large Language Models (LLMs) has drawn unprecedented global attention to AI. ChatGPT, DeepSeek, Qwen, and many other models have impressed people with their remarkable performance, making the impact of this technological wave feel very real. However, models with tens or hundreds of billions of parameters are not only difficult to train on personal devices, but often out of reach even for deployment. Opening the "black box" of large models and truly understanding how they work internally should have been an exciting thing. Unfortunately, most explorations eventually stop at applying techniques such as LoRA to fine-tune existing large models on a few new instructions or specific tasks. This is more like teaching Newton how to use a 21st-century smartphone — interesting, but not quite the original goal of understanding the essence of physics.

At the same time, third-party LLM frameworks and toolkits such as transformers / trl / peft often expose only highly abstract interfaces. With just a dozen lines of code, one can complete the entire pipeline of "load model + load dataset + inference + reinforcement learning" training. This kind of efficient encapsulation is convenient, but it also separates developers from the underlying implementation to some extent, reducing the opportunity to deeply understand the core code of LLMs. I believe that "building an airplane from Lego bricks yourself is far more exciting than flying in first class". A more practical problem is that the internet is also filled with paid courses and marketing content, where so-called AI tutorials are wrapped in flawed and half-understood explanations. For this reason, the original intention of this project is to lower the learning barrier of LLMs as much as possible, so that everyone can start from understanding every line of code and train a tiny language model by hand from scratch. Yes, training from scratch, not merely staying at the inference level. With a server cost of less than RMB 3, you can personally experience the full process of building a language model from 0 to 1.

😊 Let's share the joy of creation together!

🎉 This Project Includes the Following

Provides the full MiniMind-LLM architecture implementation (Dense + MoE), aligned with the Qwen3 / Qwen3-MoE ecosystem.
Provides the tokenizer and tokenizer training code, supporting template tokens such as <tool_call>, <tool_response>, <think>, etc.
Covers end-to-end training pipelines including pretraining, SFT, LoRA, RLHF-DPO, RLAIF (PPO / GRPO / CISPO), Tool Use, Agentic RL, Adaptive Thinking, and Model Distillation.
Provides open-source data for all stages, covering collected, distilled, cleaned, and deduplicated high-quality datasets.
Key training algorithms and core modules are all implemented from scratch, without relying on third-party framework wrappers.
Compatible with mainstream frameworks such as transformers, trl, peft, as well as commonly used inference engines like llama.cpp, vllm, ollama, and training frameworks like Llama-Factory.
Supports single-node single-GPU and single-node multi-GPU training (DDP, DeepSpeed), wandb / swanlab visualization, and dynamic training pause/resume.
Supports evaluation on third-party benchmark suites such as C-Eval, C-MMLU, OpenBookQA, etc., and supports RoPE long context extrapolation through YaRN.
Provides a lightweight OpenAI-compatible API server for integration with third-party Chat UIs such as FastGPT and Open-WebUI, with support for reasoning_content, tool_calls, and open_thinking.
Provides a minimalist chat WebUI based on Streamlit, supporting thinking display, tool selection, and multi-turn Tool Call.
Includes experimental extensions: diffusion language model (dLM) and linear attention model (Linear Attention), both of which can be further trained from the main autoregressive model.

🎉 Released Model List

Model	Parameters	Release
minimind-3	64M	2026.04.01
minimind-3-moe	198M-A64M	2026.04.01
minimind2-small	26M	2025.04.26
minimind2-moe	145M	2025.04.26
minimind2	104M	2025.04.26
minimind-v1-small	26M	2024.08.28
minimind-v1-moe	4×26M	2024.09.17
minimind-v1	108M	2024.09.01

📝 Changelog

🔥 2026-04-01

Released minimind-3 / minimind-3-moe: comprehensive updates to structure, Tokenizer, training pipeline, inference interface, and default configuration
Main branch structure aligned with Qwen3 / Qwen3-MoE ecosystem: Dense approximately 64M, MoE approximately 198M-A64M, and removed shared expert design
Default training data switched to pretrain_t2t(_mini).jsonl, sft_t2t(_mini).jsonl, rlaif.jsonl, agent_rl.jsonl, and agent_rl_math.jsonl
Removed standalone train_reason.py; thinking capability is now unified through chat_template + <think> and open_thinking adaptive switch control
toolcall capability has been merged into sft_t2t / sft_t2t_mini main branch data, default full_sft already has basic Tool Call capability; also added inference examples such as scripts/chat_api.py
Added native Agentic RL training script train_agent.py, supporting GRPO / CISPO in multi-turn Tool-Use scenarios
RLAIF / Agentic RL training pipeline completed rollout engine decoupling, supporting more flexible switching of generation backends
serve_openai_api.py and web_demo.py added reasoning_content / tool_calls / open_thinking support
Tokenizer updated based on BPE + ByteLevel, with new tool call and thinking tokens, reserved buffer tokens for future extension
Added LoRA weight merging and export pipeline, can merge base model and LoRA weights into new complete model weights via scripts/convert_model.py
Structure diagram resources updated, README extensively updated

2025-10-24

🔥 Added RLAIF training algorithms: PPO, GRPO, SPO (natively implemented from scratch)
Added checkpoint resume functionality: supports automatic training recovery, cross-GPU-count recovery, wandb record continuity
Added RLAIF dataset: rlaif-mini.jsonl (randomly sampled 10,000 entries from SFT data); simplified DPO dataset, added Chinese data
Added YaRN algorithm: supports RoPE long context extrapolation, improving long sequence processing capability
Adaptive Thinking: Reason model optionally enables chain of thought
chat_template fully supports Tool Calling and Reasoning tags (<tool_call>, <think>, etc.)
Added complete RLAIF chapter, training curve comparison, folded algorithm principle explanations
SwanLab replaces WandB (domestic access friendly, API fully compatible)
Standardized all code & fixed some known bugs

2025-04-26

Major update
For compatibility needs, visit 🔗Old Repository Content🔗.
MiniMind model parameters completely renamed, aligned with Transformers library models (unified naming).
generate method refactored, inheriting from GenerationMixin class.
🔥Supports popular third-party ecosystems such as llama.cpp, vllm, ollama.
Standardized code and directory structure.
Changed vocabulary <s></s> -> <|im_start|><|im_end|>

To be compatible with third-party inference frameworks llama.cpp, vllm, this update comes with some considerable costs.
This update no longer supports "directly" loading old models from before 25-04-26 for inference.
Due to differences between Llama's positional encoding method and minimind's, QK values differ after mapping to the Llama model.
The minimind2 series old models were all recovered through weight mapping + (fine-tuning) QKVO linear layer calibration.
After this update, maintenance for the entire `minimind-v1` series will be discontinued and taken offline from the repository.

More...

2025-02-09 - Major update since release, Release minimind2 Series. - Code almost entirely refactored, using a more concise and clear unified structure. For compatibility needs with old code, visit 🔗Old Repository Content🔗. - Eliminated data preprocessing steps. Unified dataset format, switched to jsonl format to avoid dataset download confusion issues. - minimind2 series significantly improved performance compared to MiniMind-V1. - Minor issues: {kv-cache implementation more standard, MoE load balancing loss now considered, etc.} - Provides training solution for migrating models to private datasets (medical model, self-awareness examples). - Streamlined pretraining dataset and significantly improved pretraining data quality, greatly reduced time needed for individual quick training, reproducible in 2 hours on a single 3090! - Updated: LoRA fine-tuning decoupled from peft wrapper, LoRA process implemented from scratch; DPO algorithm natively implemented from scratch using PyTorch; model white-box distillation natively implemented. - minimind2-DeepSeek-R1 series distilled models born! - minimind2 has certain English language capability! - Updated benchmark test performance results of minimind2 vs third-party models on more LLM leaderboards.

2024-10-05 - Extended multimodal capability for MiniMind --- Vision - Visit the sibling project minimind-v for details!

2024-09-27 - 09-27 updated pretrain dataset preprocessing method, to ensure text integrity, abandoned preprocessing into .bin format for training (slight sacrifice in training speed). - Currently the pretrain preprocessed file is named: pretrain_data.csv. - Removed some redundant code.

2024-09-17 - Updated minimind-v1-moe model - To prevent ambiguity, mistral_tokenizer is no longer used for tokenization, all using custom minimind_tokenizer as the tokenizer.

2024-09-01 - Updated minimind-v1 (108M) model

Core symbols most depended-on inside this repo

Logger

called by 69

trainer/trainer_utils.py

is_main_process

called by 23

trainer/trainer_utils.py

setup_seed

called by 18

trainer/trainer_utils.py

lm_checkpoint

called by 16

trainer/trainer_utils.py

init_model

called by 13

trainer/trainer_utils.py

trainer/rollout_engine.py

init_distributed_mode

called by 8

trainer/trainer_utils.py

Shape

Function 82

Method 60

Class 23

Route 1

Languages

Python100%

Modules by API surface

model/model_minimind.py29 symbols

dataset/lm_dataset.py27 symbols

trainer/rollout_engine.py17 symbols

trainer/trainer_utils.py15 symbols

scripts/web_demo.py13 symbols

scripts/serve_openai_api.py11 symbols

trainer/train_agent.py9 symbols

scripts/eval_toolcall.py9 symbols

model/model_lora.py8 symbols

trainer/train_ppo.py6 symbols

scripts/convert_model.py6 symbols

trainer/train_tokenizer.py3 symbols

Dependencies from manifests, versioned

Flask3.0.3 · 1×

Flask_Cors4.0.0 · 1×

datasets3.6.0 · 1×

datasketch1.6.4 · 1×

einops0.8.1 · 1×

jieba0.42.1 · 1×

jinja23.1.2 · 1×

jsonlines4.0.0 · 1×

marshmallow3.22.0 · 1×

modelscope1.37.0 · 1×

ngrok1.4.0 · 1×

nltk3.8 · 1×

For agents

$ claude mcp add minimind \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact