<a href="#news" style="text-decoration: none; font-weight: bold;">🎉 News</a> •
<a href="#links" style="text-decoration: none; font-weight: bold;">🔗 Links</a> •
<a href="#todo" style="text-decoration: none; font-weight: bold;">📝 Roadmap</a> •
<a href="#algorithm-flow" style="text-decoration: none; font-weight: bold;">⚙️ Algorithm Flow</a> •
<a href="#results" style="text-decoration: none; font-weight: bold;">📊 Results</a>
<a href="#getting-started" style="text-decoration: none; font-weight: bold;">✨ Getting Started</a> •
<a href="#training" style="text-decoration: none; font-weight: bold;">🏋️ Training</a> •
<a href="#usage" style="text-decoration: none; font-weight: bold;">🔧 Usage</a> •
<a href="#evaluation-code" style="text-decoration: none; font-weight: bold;">📃 Evaluation</a>
<a href="#citation" style="text-decoration: none; font-weight: bold;">🎈 Citation</a> •
<a href="#acknowledgement" style="text-decoration: none; font-weight: bold;">🌻 Acknowledgement</a> •
<a href="#contact" style="text-decoration: none; font-weight: bold;">📧 Contact</a> •
<a href="#star-history" style="text-decoration: none; font-weight: bold;">📈 Star History</a>

⚠️WARNING⚠️: New Qwen3 base models have untrained token embeddings, we used
python absolute_zero_reasoner/utils/remove_think_qwen3_tokenizer.py --model_name <Qwen3ModelName>to remove these tokens or else the model produces nonsense.🚧UNDER TESTING🚧 This new merge to
mainis still under testing. Use thepaperbranch to replicate results from original paper.
azr.executor=sandboxfusion in training configs. Officially completed our initial roadmap.paper branch to reproduce the paper results with static copy of veRL. The main branch will now be regularly updated with the latest veRL versions.✅ Release training code
✅ Release evaluation code
✅ Update veRL
✅ Upgrade Python executor
Our approach centers on a repeated iterative process of the following two steps:
PROPOSE: The model generates reasoning tasks from abduction, deduction, and induction types. Tasks are validated with Python execution and assigned a learnability reward.
SOLVE: The model then attempts to solve these self-generated tasks. Solutions are verified through Python execution, receiving an accuracy reward.
The model continuously improves through both phases using TRR++, creating a self-evolving loop that strengthens reasoning without external training data.

Our approach achieves strong performance across both code and math reasoning benchmarks without using any external data:
| Model | Base | #data | Code Avg | Math Avg | Total Avg |
|---|---|---|---|---|---|
| Base Models | |||||
| Qwen2.5-7B | - | - | 52.0 | 27.5 | 39.8 |
| Qwen2.5-7B-Ins | - | - | 56.3 | 37.0 | 46.7 |
| Qwen2.5-7B-Coder | - | - | 56.6 | 23.9 | 40.2 |
| Reasoners Trained on Curated Code Data | |||||
| AceCoder-RM | Ins | 22k | 58.3 | 37.4 | 47.9 |
| AceCoder-RM | Coder | 22k | 57.3 | 27.5 | 42.4 |
| AceCoder-Rule | Ins | 22k | 55.4 | 36.9 | 46.2 |
| AceCoder-Rule | Coder | 22k | 60.0 | 28.5 | 44.3 |
| CodeR1-LC2k | Ins | 2k | 60.5 | 35.6 | 48.0 |
| CodeR1-12k | Ins | 10k | 61.3 | 33.5 | 47.4 |
| Reasoners Trained on Curated Math Data | |||||
| PRIME-Zero | Coder | 484k | 37.2 | 45.8 | 41.5 |
| SimpleRL-Zoo | Base | 8.5k | 54.0 | 38.5 | 46.3 |
| Oat-Zero | Math | 8.5k | 45.4 | 44.3 | 44.9 |
| ORZ | Base | 57k | 55.6 | 41.6 | 48.6 |
| Absolute Zero Training w/ No Curated Data (Ours) | |||||
| AZR (Ours) | Base | 0 | 55.2 +3.2 | 38.4 +10.9 | 46.8 +7.0 |
| AZR (Ours) | Coder | 0 | 61.6 +5.0 | 39.1 +15.2 | 50.4 +10.2 |
AZR shows consistent improvements across model sizes and types:
| Model Family | Variant | Code Avg | Math Avg | Total Avg |
|---|---|---|---|---|
| Llama3.1-8b | 28.5 | 3.4 | 16.0 | |
| Llama3.1-8b | + AZR (Ours) | 31.6 +3.1 | 6.8 +3.4 | 19.2 +3.2 |
| Qwen2.5-3B Coder | 51.2 | 18.8 | 35.0 | |
| Qwen2.5-3B Coder | + AZR (Ours) | 54.9 +3.7 | 26.5 +7.7 | 40.7 +5.7 |
| Qwen2.5-7B Coder | 56.6 | 23.9 | 40.2 | |
| Qwen2.5-7B Coder | + AZR (Ours) | 61.6 +5.0 | 39.1 +15.2 | 50.4 +10.2 |
| Qwen2.5-14B Coder | 60.0 | 20.2 | 40.1 | |
| Qwen2.5-14B Coder | + AZR (Ours) | 63.6 +3.6 | 43.0 +22.8 | 53.3 +13.2 |
conda env create -f azr_env.yml
conda activate azr
pip install -r flashattn_requirements.txt
python -m absolute_zero_reasoner.data_construction.process_code_reasoning_data
⚠️WARNING⚠️: The Python executor in this repository is very raw and intended for research purposes only. It is not secure for production environments. We plan to update our executor to more secure implementations in the future. Your use of our code i
$ claude mcp add Absolute-Zero-Reasoner \
-- python -m otcore.mcp_server <graph>