MCPcopy
hub / github.com/FareedKhan-dev/train-llm-from-scratch

github.com/FareedKhan-dev/train-llm-from-scratch @main sqlite

repository ↗ · DeepWiki ↗
270 symbols 1,341 edges 78 files 144 documented · 53%
README

main image

Train LLM From Scratch

Python License Contributions Docs

I am Looking for a PhD position in AI. GitHub

I implemented a transformer model from scratch using PyTorch, based on the paper Attention is All You Need. You can use my scripts to train your own billion or million parameter LLM using a single GPU.

This started as a pretraining tutorial. It now goes all the way from raw text to an aligned, reasoning style model, with every algorithm hand written in plain PyTorch (no trl, no peft, no transformers). The whole journey is one idea repeated: turn text into numbers, predict the next token, then keep changing the data and the loss until the model does what we want.

From raw text to an aligned reasoning model

Here is the path we will walk, end to end:

raw text  ->  tokens  ->  a Transformer  ->  next-token loss  ->  a base model
base model  ->  SFT  ->  Reward Model  ->  {PPO, DPO}  ->  GRPO  ->  evaluation and chat

Below is the output of a trained 13 million parameter LLM, just so you can see where the small end of this starts:

In ***1978, The park was returned to the factory-plate that
the public share to the lower of the electronic fence that
follow from the Station's cities. The Canal of ancient Western
nations were confined to the city spot. The villages were directly
linked to cities in China that revolt that the US budget and in
Odambinais is uncertain and fortune established in rural areas.

Table of Contents

Who this is for

I tried to write this so one page works for very different readers:

  • If you are a student, read top to bottom. Every block of code comes after a plain explanation of what it does and why, and most blocks are followed by the output you should expect.
  • If you are a developer, the commands and file paths are all here. You can copy, run, and read the referenced source files directly.
  • If you are a researcher, the post-training half is the interesting part: SFT, a Bradley-Terry reward model, PPO with GAE, DPO/ORPO/KTO, and GRPO, all from scratch on the same small Transformer, trained on real public datasets.

Every diagram in this README is colored the same way, so the colors mean something:

  • green is raw data
  • teal is stored, tokenized data on disk
  • blue is a plain processing step
  • yellow is the model or a training step
  • orange is the reinforcement learning and reward parts
  • red is a loss
  • grey is a saved checkpoint
  • purple is the final output or evaluation

Prerequisites and Training Time

You need a basic understanding of object oriented programming, neural networks, and PyTorch. Below are some resources to help you get started:

Topic Video Link
OOP OOP Video
Neural Network Neural Network Video
Pytorch Pytorch Video

You will need a GPU to train. A free Colab or Kaggle T4 is enough for the 13 million parameter model, but it will not fit a billion parameter model. Here is a rough guide:

GPU Name Memory 2B LLM Training 13M LLM Training Max Practical LLM Size (Training)
NVIDIA A100 40 GB ~6B to 8B
NVIDIA V100 16 GB ~2B
NVIDIA RTX 4090 24 GB ~4B
NVIDIA RTX 5090 32 GB 13M verified, larger configs TBD
NVIDIA RTX 3090 24 GB ~3.5B to 4B
NVIDIA RTX 4080 16 GB ~2B
NVIDIA RTX 4060 8 GB ~1B
Tesla T4 16 GB ~1.5B to 2B

If a large config runs out of memory, the pretraining script has opt-in flags (--amp, --grad-checkpointing, --grad-accum) that bring the memory down a lot. More on those later.

Setup

Clone the repository and install it in editable mode. The editable install puts config, src, data_loader, and ui on your import path, so you do not need to set PYTHONPATH by hand anymore:

git clone https://github.com/FareedKhan-dev/train-llm-from-scratch.git
cd train-llm-from-scratch
pip install -e .

There are optional extras for the parts you want:

pip install -e ".[train]"   # datasets + wandb, for downloading data and logging
pip install -e ".[ui]"      # streamlit + pandas + altair, for the control panel
pip install -e ".[docs]"    # mkdocs, for the documentation site
pip install -e ".[all]"     # everything

There are two config systems, and it helps to know which is which from the start:

  • config/config.py is the original, simple config for the legacy pretraining script scripts/train_transformer.py. It is plain Python constants.
  • config/post_training_config.py plus the JSON files in configs/ drive everything else (pretraining the bigger base, SFT, reward, DPO, PPO, GRPO). You edit a small JSON file per stage, and any field can also be overridden on the command line, for example --lr 2e-5 --batch_size 16.

For fast checks there is a tiny configs/smoke/ variant of every stage that shrinks the model so a full run finishes in seconds on a CPU or a single GPU.

Code Structure

train-llm-from-scratch/
├── src/
│   ├── models/                  # the Transformer, built from small pieces
│   │   ├── mlp.py               # the feed-forward block
│   │   ├── attention.py         # single head and multi head attention
│   │   ├── transformer_block.py # one block: attention + MLP + residuals
│   │   └── transformer.py       # the full model: embeddings + blocks + lm_head
│   └── post_training/           # SFT, reward model, PPO, DPO, GRPO, eval, inference
├── config/
│   ├── config.py                # legacy pretraining config (plain constants)
│   ├── post_training_config.py  # dataclasses for every post-training stage
│   └── loader.py                # merges defaults < base.json < stage.json < CLI
├── configs/                     # editable JSON, one file per stage (+ smoke/)
├── data_loader/                 # batch iterators for each kind of data
├── scripts/                     # every runnable step lives here
├── ui/                          # the Streamlit control panel
├── docs/                        # the MkDocs site (theory + diagrams)
├── images/                      # the diagrams in this README (+ the generator)
└── pyproject.toml               # pip install -e .

Step 1: Preparing the Data

A model only ever sees integers. So the first job is always the same: take text, turn it into token ids, and store those ids on disk in a format that is fast to read during training. We do this four times, once for each kind of training we will do later.

The data pipeline

The four streams are:

  1. Pretraining text from The Pile, stored as a flat array of token ids in an HDF5 file.
  2. Instruction data (Alpaca, Dolly, GSM8K) for SFT, packed into fixed length rows with a mask that says which tokens are the assistant's answer.
  3. Preference pairs (Anthropic HH-RLHF, UltraFeedback) for the reward model and DPO, stored as {prompt, chosen, rejected}.
  4. RL prompts (GSM8K and a small arithmetic warm-up) for PPO and GRPO, stored as {prompt, gold}.

Tokenization

We use the r50k_base tokenizer from OpenAI's tiktoken, the same one GPT-3 used. Text becomes a list of integers, and we append a special <|endoftext|> token (id 50256) at the end of every document so the model learns where one piece of text stops and the next begins.

Tokenization

For the legacy path, download a slice of The Pile and tokenize it into HDF5:

python scripts/data_download.py            # downloads the validation file + 1 training shard
python scripts/data_preprocess.py          # tokenizes to data/train/pile_train.h5 and data/val/pile_dev.h5

The newer, faster path streams and batch-encodes the same data straight into a flat token array:

python scripts/prepare_pretrain_data.py --split val   --out data/pile_dev.h5
python scripts/prepare_pretrain_data.py --split train --num_shards 1 --out data/pile_train.h5

Once tokenized, the data is just a long line of integers. Here is a real peek at the validation file I prepared for this README (8.76 million tokens), the first ten ids, and what they decode back to:

#### OUTPUT ####
dtype: int32 | shape: (8762951,) | total tokens: 8762951
first 10 token ids: [18610, 286, 3993, 3081, 319, 4088, 11, 4640, 2163, 11]
decoded back to text:
'Effect of sleep quality on memory, executive function, and language
 performance in patients with refractory focal epilepsy ...'

That is the whole idea of tokenization in one output: text in, a flat array of integers out, and the integers decode straight back to the original words.

The chat format and loss mask

For everything after pretraining the model has to know who is talking. The r50k_base tokenizer has only one special token, so instead of inventing new ones we use plain text role markers that the model simply learns during SFT. A single turn looks like this (see src/post_training/chat_template.py):

<|user|>
{user content}<|endoftext|><|assistant|>
{assistant content}<|endoftext|>

For math and reasoning we ask the assistant to show its work in a fixed structure, because the reinforcement learning reward later checks the number inside the answer tags:

<think>step by step reasoning ...</think><answer>42</answer>

The important trick is the loss mask. When we encode a conversation we also build a 0/1 mask that is 1 only on the assistant tokens (and the <|endoftext|> that ends the turn). That way SFT trains the model to write answers, not to parrot the prompt back. Here is the exact code that builds the ids and the aligned mask:

def encode_chat(messages, add_generation_prompt=False):
    ids, mask = [], []
    for m in messages:
        role = m["role"]
        # Role header is always masked out (we never train the model to emit it).
        header_ids = _encode_ordinary(_header_for(role))
        ids.extend(header_ids)
        mask.extend([0] * len(header_ids))

        content_ids = _encode_ordinary(m["content"])
        is_completion = role == "assistant"
        ids.extend(content_ids)
        mask.extend([1 if is_completion else 0] * len(content_ids))   # train on assistant only

        ids.append(EOT_ID)                                            # turn terminator
        mask.append(1 if is_completion else 0)                        # learn to stop
    return ids, mask

Here is a real rendered conversation and the verifier reward in action, printed from this repo:

#### OUTPUT ####
rendered chat:
<|user|>
What is 13 + 29?<|endoftext|><|assistant|>
<think>13 + 29 = 42</think><answer>42</answer><|endoftext|>

extract_answer("<answer>42</answer>")        -> 42.0
reward_gsm8k("<answer>42</answer>", 42.0)    -> 1.2    # correct AND well formatted
reward_gsm8k("<answer>7</answer>",  42.0)    -> 0.2    # wrong, but it used the format

And here is one real packed SFT row, showing how only the assistant tokens are trained (the mask is 1 on 48 of the 512 tokens in this row):

```python

OUTPUT

tokens shape: (2131, 512) | loss_mask shape: (2131, 512)

Core symbols most depended-on inside this repo

check
called by 25
tests/verify_data_and_eval.py
unwrap
called by 19
src/post_training/utils.py
compute_logprobs
called by 15
src/post_training/rollout.py
amp_autocast
called by 14
src/post_training/utils.py
log
called by 13
src/post_training/logging_utils.py
save_stage_ckpt
called by 12
src/post_training/utils.py
reduce_scalar
called by 12
src/post_training/distributed.py
masked_mean
called by 11
src/post_training/utils.py

Shape

Function 222
Method 30
Class 18

Languages

Python100%

Modules by API surface

scripts/train_transformer.py22 symbols
ui/jobs.py13 symbols
src/post_training/utils.py13 symbols
tests/test_post_training_smoke.py10 symbols
src/post_training/distributed.py10 symbols
src/post_training/rollout.py9 symbols
config/post_training_config.py8 symbols
src/post_training/chat_template.py7 symbols
src/models/transformer.py7 symbols
scripts/prepare_sft_data.py7 symbols
tests/verify_data_and_eval.py6 symbols
tests/test_rl_math.py6 symbols

Dependencies from manifests, versioned

h5py
torch
tqdm
zstandard

For agents

$ claude mcp add train-llm-from-scratch \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact