
中文 | English
Note: This project is released under the Apache 2.0 license and is completely free. "2 hours" refers to the measured time for running
1 epochof the SFT stage on a single NVIDIA 3090, while "RMB 3" refers to the corresponding GPU rental cost.

🔗 Online Demo | 🔗 Video Introduction
|
|
The emergence of Large Language Models (LLMs) has drawn unprecedented global attention to AI. ChatGPT, DeepSeek, Qwen, and many other models have impressed people with their remarkable performance, making the impact of this technological wave feel very real. However, models with tens or hundreds of billions of parameters are not only difficult to train on personal devices, but often out of reach even for deployment. Opening the "black box" of large models and truly understanding how they work internally should have been an exciting thing. Unfortunately, most explorations eventually stop at applying techniques such as LoRA to fine-tune existing large models on a few new instructions or specific tasks. This is more like teaching Newton how to use a 21st-century smartphone — interesting, but not quite the original goal of understanding the essence of physics.
At the same time, third-party LLM frameworks and toolkits such as transformers / trl / peft often expose only highly abstract interfaces. With just a dozen lines of code, one can complete the entire pipeline of "load model + load dataset + inference + reinforcement learning" training. This kind of efficient encapsulation is convenient, but it also separates developers from the underlying implementation to some extent, reducing the opportunity to deeply understand the core code of LLMs. I believe that "building an airplane from Lego bricks yourself is far more exciting than flying in first class". A more practical problem is that the internet is also filled with paid courses and marketing content, where so-called AI tutorials are wrapped in flawed and half-understood explanations. For this reason, the original intention of this project is to lower the learning barrier of LLMs as much as possible, so that everyone can start from understanding every line of code and train a tiny language model by hand from scratch. Yes, training from scratch, not merely staying at the inference level. With a server cost of less than RMB 3, you can personally experience the full process of building a language model from 0 to 1.
😊 Let's share the joy of creation together!
Qwen3 / Qwen3-MoE ecosystem.<tool_call>, <tool_response>, <think>, etc.transformers, trl, peft, as well as commonly used inference engines like llama.cpp, vllm, ollama, and training frameworks like Llama-Factory.reasoning_content, tool_calls, and open_thinking.| Model | Parameters | Release |
|---|---|---|
| minimind-3 | 64M | 2026.04.01 |
| minimind-3-moe | 198M-A64M | 2026.04.01 |
| minimind2-small | 26M | 2025.04.26 |
| minimind2-moe | 145M | 2025.04.26 |
| minimind2 | 104M | 2025.04.26 |
| minimind-v1-small | 26M | 2024.08.28 |
| minimind-v1-moe | 4×26M | 2024.09.17 |
| minimind-v1 | 108M | 2024.09.01 |
🔥 2026-04-01
minimind-3 / minimind-3-moe: comprehensive updates to structure, Tokenizer, training pipeline, inference interface, and default configurationQwen3 / Qwen3-MoE ecosystem: Dense approximately 64M, MoE approximately 198M-A64M, and removed shared expert designpretrain_t2t(_mini).jsonl, sft_t2t(_mini).jsonl, rlaif.jsonl, agent_rl.jsonl, and agent_rl_math.jsonltrain_reason.py; thinking capability is now unified through chat_template + <think> and open_thinking adaptive switch controltoolcall capability has been merged into sft_t2t / sft_t2t_mini main branch data, default full_sft already has basic Tool Call capability; also added inference examples such as scripts/chat_api.pyAgentic RL training script train_agent.py, supporting GRPO / CISPO in multi-turn Tool-Use scenariosrollout engine decoupling, supporting more flexible switching of generation backendsserve_openai_api.py and web_demo.py added reasoning_content / tool_calls / open_thinking supportBPE + ByteLevel, with new tool call and thinking tokens, reserved buffer tokens for future extensionscripts/convert_model.py2025-10-24
<tool_call>, <think>, etc.)2025-04-26
<s></s> -> <|im_start|><|im_end|>To be compatible with third-party inference frameworks llama.cpp, vllm, this update comes with some considerable costs.
This update no longer supports "directly" loading old models from before 25-04-26 for inference.
Due to differences between Llama's positional encoding method and minimind's, QK values differ after mapping to the Llama model.
The minimind2 series old models were all recovered through weight mapping + (fine-tuning) QKVO linear layer calibration.
After this update, maintenance for the entire `minimind-v1` series will be discontinued and taken offline from the repository.
More...
2025-02-09
- Major update since release, Release minimind2 Series.
- Code almost entirely refactored, using a more concise and clear unified structure.
For compatibility needs with old code, visit 🔗Old Repository Content🔗.
- Eliminated data preprocessing steps. Unified dataset format, switched to jsonl format to avoid dataset download confusion issues.
- minimind2 series significantly improved performance compared to MiniMind-V1.
- Minor issues: {kv-cache implementation more standard, MoE load balancing loss now considered, etc.}
- Provides training solution for migrating models to private datasets (medical model, self-awareness examples).
- Streamlined pretraining dataset and significantly improved pretraining data quality, greatly reduced time needed for individual quick training, reproducible in 2 hours on a single 3090!
- Updated: LoRA fine-tuning decoupled from peft wrapper, LoRA process implemented from scratch; DPO algorithm natively implemented from scratch using PyTorch; model white-box distillation natively implemented.
- minimind2-DeepSeek-R1 series distilled models born!
- minimind2 has certain English language capability!
- Updated benchmark test performance results of minimind2 vs third-party models on more LLM leaderboards.
2024-10-05 - Extended multimodal capability for MiniMind --- Vision - Visit the sibling project minimind-v for details!
2024-09-27 - 09-27 updated pretrain dataset preprocessing method, to ensure text integrity, abandoned preprocessing into .bin format for training (slight sacrifice in training speed). - Currently the pretrain preprocessed file is named: pretrain_data.csv. - Removed some redundant code.
2024-09-17 - Updated minimind-v1-moe model - To prevent ambiguity, mistral_tokenizer is no longer used for tokenization, all using custom minimind_tokenizer as the tokenizer.
2024-09-01 - Updated minimind-v1 (108M) model
$ claude mcp add minimind \
-- python -m otcore.mcp_server <graph>