hub / github.com/KellerJordan/modded-nanogpt

github.com/KellerJordan/modded-nanogpt @main sqlite

2,918 symbols 8,641 edges 131 files 494 documented · 17%

README

Modded-NanoGPT

This repository hosts the NanoGPT speedrun, in which we (collaboratively|competitively) search for the fastest algorithm to use 8 NVIDIA H100 GPUs to train a language model that attains 3.28 cross-entropy loss on the FineWeb validation set.

(Note: Besides the main track, there is also an optimization track where we try to minimize steps subject to fixed arch/data/bsz and with unlimited wallclock budget.)

The target (3.28 validation loss on FineWeb) follows Andrej Karpathy's GPT-2 replication in llm.c, which attains that loss after running for 45 minutes. The speedrun code also descends from llm.c's PyTorch trainer, which itself descends from NanoGPT, hence the name of the repo. Thanks to the efforts of many contributors, this repo now contains a training algorithm which attains the target performance in: * Under 90 seconds on 8xH100 (the llm.c GPT-2 replication needed 45 minutes) * under 400M tokens (the llm.c GPT-2 replication needed 10B)

This improvement in training speed has been brought about by the following techniques: * Modernized architecture: Rotary embeddings, QK-Norm, and ReLU² * The Muon optimizer [writeup] [repo] * Use FP8 for head, and asymmetric rescale and softcap logits * Use FP8 on MLP up projection forward pass * Initialization of projections to zero (muP-like) * Skip connections from embedding to every block as well as from block 3 to 6 * Extra embeddings which are mixed into the values in attention layers (inspired by Zhou et al. 2024) * Flash Attention 3 with long-short sliding window attention pattern (inspired by Gemma 2) and window size warmup with YaRN * Align training batch starts with EoS and set a max document length * Accumulate gradients for 2 steps for embedding and lm_head before updating parameters * Single activation input for last 3 attention layers * Polar Express implementation in Muon * Smear module to enable 1 token look back * Sparse attention gate * NorMuon * Cautious Weight Decay w/ schedule tied to LR * Exponential decay of residual stream * Batch size schedule * Max seq length schedule * Partial Key Offset * Multi token prediction * Untie embed and lm_head at 2/3 of training * Additional gating on value embeddings and skip connection * Paired head attention * Bigram hash embedding on 1/4 of model_dim w/ sign trick * MUDD skip connections to residual stream and attention values * Learnable XSA

As well as many systems optimizations.

Contributors list (growing with each new record): @bozavlado; @brendanh0gan; @fernbear.bsky.social; @Grad62304977; @jxbz; @kellerjordan0; @KoszarskyB; @leloykun; @YouJiacheng; @jadenj3o; @KonstantinWilleke, @alexrgilbert, @adricarda, @tuttyfrutyee, @vdlad; @ryanyang0, @vagrawal, @classiclarryd, @byronxu99, @varunneal, @EmelyanenkoK, @bernard24/https://www.hiverge.ai/, @Gusarich, @li_zichong, @akash5474, @snimu, @roeeshenberg, @ChrisJMcCormick, @dominikkallusky, @acutkosky, @manikbhandari, @andrewbriand, @jrauvola, @soren_dunn_, @photon_mz, @srashedll, @dhrvji, @EmmettBicker, @dualverse-ai, @sisovicm, @moof2x, @samacqua, @Lisennlp, @_djdumpling, @TrianX

Running the current record

To run the current record, run the following commands.

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
pip install -r requirements.txt
# downloads only the first 900M training tokens to save time
python data/cached_fineweb10B.py 9
./run.sh

Add torchrun to path if ./run.sh gives error torchrun: command not found.

Note: torch.compile will add around 7 minutes of latency the first time you run the code.

Official records are timed on 8 NVIDIA H100 GPUs from https://app.primeintellect.ai/. PrimeIntellect has generously sponsored recent validation runs.

Alternative: Running with Docker (recommended for precise timing)

For cases where CUDA or NCCL versions aren't compatible with your current system setup, Docker can be a helpful alternative. This approach standardizes versions for CUDA, NCCL, CUDNN, and Python, reducing dependency issues and simplifying setup. Note: an NVIDIA driver must already be installed on the system (useful if only the NVIDIA driver and Docker are available).

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
sudo docker build -t modded-nanogpt .
sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt python data/cached_fineweb10B.py 8
sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt sh run.sh

To get an interactive docker, you can use

sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt bash

World record history

The following is the historical progression of world speed records for the following competitive task:

Train a neural network to ≤3.28 validation loss on FineWeb using 8x NVIDIA H100s.

Note: The 3.28 target was selected to match Andrej Karpathy's GPT-2 (small) reproduction.

#	Record time	Description	Date	Log	Contributors
1	45 minutes	llm.c baseline	05/28/24	log	@karpathy, llm.c contributors
2	31.4 minutes	Tuned learning rate & rotary embeddings	06/06/24	log	@kellerjordan0
3	24.9 minutes	Introduced the Muon optimizer	10/04/24	none	@kellerjordan0, @jxbz
4	22.3 minutes	Muon improvements	10/11/24	log	@kellerjordan0, @bozavlado
5	15.2 minutes	Pad embeddings, ReLU², zero-init projections, QK-norm	10/14/24	log	@Grad62304977, @kellerjordan0
6	13.1 minutes	Distributed the overhead of Muon	10/18/24	log	@kellerjordan0
7	12.0 minutes	Upgraded PyTorch 2.5.0	10/18/24	log	@kellerjordan0
8	10.8 minutes	Untied embedding and head	11/03/24	log	@Grad62304977, @kellerjordan0
9	8.2 minutes	Value and embedding skip connections, momentum warmup, logit softcap	11/06/24	log	@Grad62304977, @kellerjordan0
10	7.8 minutes	Bfloat16 activations	11/08/24	log	@kellerjordan0
11	7.2 minutes	U-net pattern skip connections & double lr	11/10/24	log	@brendanh0gan
12	5.03 minutes	1024-ctx dense causal attention → 64K-ctx FlexAttention	11/19/24	log	@KoszarskyB
13	4.66 minutes	Attention window warmup	11/24/24	log	@fernbear.bsky.social
14	4.41 minutes	Value Embeddings	12/04/24	log	@KoszarskyB
15	3.95 minutes	U-net pattern value embeddings, assorted code optimizations	12/08/24	log	@leloykun, @YouJiacheng
16	3.80 minutes	Split value embeddings, block sliding window, separate block mask	12/10/24	log	@YouJiacheng
17	3.57 minutes	Sparsify value embeddings, improve rotary embeddings, drop an attn layer	12/17/24	log	@YouJiacheng
18	3.4 minutes	Lower logit softcap from 30 to 15	01/04/25	log	@KoszarskyB
19	3.142 minutes	FP8 head, offset logits, lr decay to 0.1 instead of 0.0	01/13/25	log	@YouJiacheng
20	2.992 minutes	Merged QKV weights, long-short attention, attention scale, lower Adam epsilon, batched Muon	01/16/25	log	@leloykun, @fernbear.bsky.social, @YouJiacheng, @brendanh0gan, @scottjmaddox, @Grad62304977
21	2.933 minutes	Reduced batch size	01/26/25	log	@leloykun
21	2.997 minutes	21st record with new timing	02/01/25	log	not a new record, just re-timing #21 with the updated rules
21	3.014 minutes	21st record with latest torch	05/24/25	log	not a new record, just re-timing #21 with latest torch
22	2.990 minutes	Faster gradient all-reduce	05/24/25	log	@KonstantinWilleke, @alexrgilbert, @adricarda, @tuttyfrutyee, @vdlad; The Enigma project
23	2.979 minutes	Overlap computation and gradient communication	05/25/25	log	@ryanyang0
24	2.966 minutes	Replace gradient all_reduce with reduce_scatter	05/30/25	log	@vagrawal
25	2.896 minutes	Upgrade PyTorch to 2.9.0.dev20250713+cu126	07/13/25	log	@kellerjordan0
26	2.863 minutes	Align training batch starts with EoS, increase cooldown frac to .45	07/13/25	log	@classiclarryd
27	2.817 minutes	Transpose one of the MLP matrices + add Triton kernel for symmetric matmul	07/18/25	log,PR	@byronxu99
28	2.812 minutes	Sparse attention gate	08/23/25	log,PR	@classiclarryd
29	2.731 minutes	Flash Attention 3, 2048 max_doc_len, update ws schedule	09/03/25	[log](records/track_1_short/2025-09-03_FA3/44fc1276-0510-4961-9

Core symbols most depended-on inside this repo

get

called by 452

records/track_1_short/2025-12-11_NorMuonOptimsAndFixes/profiler-example-traces/train_gpt-profiler-example.py

print0

called by 157

records/track_3_optimization/results/20260520_tail_refinterp_2900/train_gpt_tail_refinterp_2900.py

print0

called by 156

records/track_3_optimization/results/20260529_tail_phase_readout_2850/train_gpt_tail_phase_readout_2850.py

numel

called by 94

records/track_3_optimization/results/20260513_shampoo_1_4_power/distributed_shampoo/preconditioner/preconditioner_list.py

print0

called by 84

records/track_3_optimization/results/20260611_tailema_2720_submission/train_gpt_tailema_2720.py

print0

called by 84

records/track_3_optimization/results/20260611_tailema_2730_submission/train_gpt_tailema_2730.py

state_dict

called by 72

records/track_3_optimization/results/20260513_shampoo_1_4_power/distributed_shampoo/utils/optimizer_modules.py

print0

called by 71

records/track_3_optimization/results/20260611_tailema_2730_submission/ablation/combo_2740/train_gpt_combo_2740.py

Shape

Method 1,549

Function 771

Class 557

Route 41

Languages

Python100%

Modules by API surface

records/track_3_optimization/results/20260529_tail_phase_readout_2850/train_gpt_tail_phase_readout_2850.py103 symbols

records/track_3_optimization/results/20260520_tail_refinterp_2900/train_gpt_tail_refinterp_2900.py98 symbols

train_gpt_medium.py87 symbols

records/track_3_optimization/results/20260520_rre_extrapolation_pr300_2925/train_gpt_simple_rre_pr300_2925.py87 symbols

train_gpt.py86 symbols

records/track_1_short/2025-12-11_NorMuonOptimsAndFixes/profiler-example-traces/train_gpt-profiler-example.py74 symbols

records/track_3_optimization/results/20260611_tailema_2730_submission/train_gpt_tailema_2730.py72 symbols

records/track_3_optimization/results/20260611_tailema_2720_submission/train_gpt_tailema_2720.py72 symbols

records/track_3_optimization/results/20260611_tailema_2730_submission/ablation/combo_2740/train_gpt_combo_2740.py71 symbols

records/track_3_optimization/results/20260513_shampoo_1_4_power/distributed_shampoo/tests/distributed_shampoo_test.py70 symbols

records/track_3_optimization/results/20260513_shampoo_1_4_power/distributed_shampoo/preconditioner/tests/matrix_functions_test.py67 symbols

records/track_3_optimization/results/20260525_aurora_ema_ref/train_gpt_simple_aurora_ema_ref.py65 symbols

Dependencies from manifests, versioned

torch2.10 · 1×

typing-extensions4.15.0 · 1×

For agents

$ claude mcp add modded-nanogpt \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact