hub / github.com/FareedKhan-dev/train-llm-from-scratch

github.com/FareedKhan-dev/train-llm-from-scratch @main sqlite

270 symbols 1,341 edges 78 files 144 documented · 53%

README

main image

Train LLM From Scratch

I am Looking for a PhD position in AI. GitHub

I implemented a transformer model from scratch using PyTorch, based on the paper Attention is All You Need. You can use my scripts to train your own billion or million parameter LLM using a single GPU.

This started as a pretraining tutorial. It now goes all the way from raw text to an aligned, reasoning style model, with every algorithm hand written in plain PyTorch (no trl, no peft, no transformers). The whole journey is one idea repeated: turn text into numbers, predict the next token, then keep changing the data and the loss until the model does what we want.

From raw text to an aligned reasoning model

Here is the path we will walk, end to end:

raw text  ->  tokens  ->  a Transformer  ->  next-token loss  ->  a base model
base model  ->  SFT  ->  Reward Model  ->  {PPO, DPO}  ->  GRPO  ->  evaluation and chat

Below is the output of a trained 13 million parameter LLM, just so you can see where the small end of this starts:

In ***1978, The park was returned to the factory-plate that
the public share to the lower of the electronic fence that
follow from the Station's cities. The Canal of ancient Western
nations were confined to the city spot. The villages were directly
linked to cities in China that revolt that the US budget and in
Odambinais is uncertain and fortune established in rural areas.

Who this is for
Prerequisites and Training Time
Setup
Code Structure
Step 1: Preparing the Data
Step 2: The Model, Built From Small Pieces
Multi Layer Perceptron (MLP)
Single Head Attention
Multi Head Attention
The Transformer Block
The Full Transformer
Step 3: Pretraining the Base Model
Step 4: Generating Text
Step 5: Post-Training, Turning a Base Model Into an Assistant
SFT (Supervised Fine-Tuning)
The Reward Model
DPO, ORPO and KTO
PPO
GRPO / RLVR
Step 6: Evaluation
Step 7: Talking to the Model
The Streamlit Control Panel
The Documentation Site
Run the Whole Thing
What's Next

Who this is for

I tried to write this so one page works for very different readers:

If you are a student, read top to bottom. Every block of code comes after a plain explanation of what it does and why, and most blocks are followed by the output you should expect.
If you are a developer, the commands and file paths are all here. You can copy, run, and read the referenced source files directly.
If you are a researcher, the post-training half is the interesting part: SFT, a Bradley-Terry reward model, PPO with GAE, DPO/ORPO/KTO, and GRPO, all from scratch on the same small Transformer, trained on real public datasets.

Every diagram in this README is colored the same way, so the colors mean something:

green is raw data
teal is stored, tokenized data on disk
blue is a plain processing step
yellow is the model or a training step
orange is the reinforcement learning and reward parts
red is a loss
grey is a saved checkpoint
purple is the final output or evaluation

Prerequisites and Training Time

You need a basic understanding of object oriented programming, neural networks, and PyTorch. Below are some resources to help you get started:

Topic	Video Link
OOP	OOP Video
Neural Network	Neural Network Video
Pytorch	Pytorch Video

You will need a GPU to train. A free Colab or Kaggle T4 is enough for the 13 million parameter model, but it will not fit a billion parameter model. Here is a rough guide:

GPU Name	Memory	2B LLM Training	13M LLM Training	Max Practical LLM Size (Training)
NVIDIA A100	40 GB	✔	✔	~6B to 8B
NVIDIA V100	16 GB	✘	✔	~2B
NVIDIA RTX 4090	24 GB	✔	✔	~4B
NVIDIA RTX 5090	32 GB	✔	✔	13M verified, larger configs TBD
NVIDIA RTX 3090	24 GB	✔	✔	~3.5B to 4B
NVIDIA RTX 4080	16 GB	✘	✔	~2B
NVIDIA RTX 4060	8 GB	✘	✔	~1B
Tesla T4	16 GB	✘	✔	~1.5B to 2B

If a large config runs out of memory, the pretraining script has opt-in flags (--amp, --grad-checkpointing, --grad-accum) that bring the memory down a lot. More on those later.

Setup

Clone the repository and install it in editable mode. The editable install puts config, src, data_loader, and ui on your import path, so you do not need to set PYTHONPATH by hand anymore:

git clone https://github.com/FareedKhan-dev/train-llm-from-scratch.git
cd train-llm-from-scratch
pip install -e .

There are optional extras for the parts you want:

pip install -e ".[train]"   # datasets + wandb, for downloading data and logging
pip install -e ".[ui]"      # streamlit + pandas + altair, for the control panel
pip install -e ".[docs]"    # mkdocs, for the documentation site
pip install -e ".[all]"     # everything

There are two config systems, and it helps to know which is which from the start:

config/config.py is the original, simple config for the legacy pretraining script scripts/train_transformer.py. It is plain Python constants.
config/post_training_config.py plus the JSON files in configs/ drive everything else (pretraining the bigger base, SFT, reward, DPO, PPO, GRPO). You edit a small JSON file per stage, and any field can also be overridden on the command line, for example --lr 2e-5 --batch_size 16.

For fast checks there is a tiny configs/smoke/ variant of every stage that shrinks the model so a full run finishes in seconds on a CPU or a single GPU.

Code Structure

train-llm-from-scratch/
├── src/
│   ├── models/                  # the Transformer, built from small pieces
│   │   ├── mlp.py               # the feed-forward block
│   │   ├── attention.py         # single head and multi head attention
│   │   ├── transformer_block.py # one block: attention + MLP + residuals
│   │   └── transformer.py       # the full model: embeddings + blocks + lm_head
│   └── post_training/           # SFT, reward model, PPO, DPO, GRPO, eval, inference
├── config/
│   ├── config.py                # legacy pretraining config (plain constants)
│   ├── post_training_config.py  # dataclasses for every post-training stage
│   └── loader.py                # merges defaults < base.json < stage.json < CLI
├── configs/                     # editable JSON, one file per stage (+ smoke/)
├── data_loader/                 # batch iterators for each kind of data
├── scripts/                     # every runnable step lives here
├── ui/                          # the Streamlit control panel
├── docs/                        # the MkDocs site (theory + diagrams)
├── images/                      # the diagrams in this README (+ the generator)
└── pyproject.toml               # pip install -e .

Step 1: Preparing the Data

A model only ever sees integers. So the first job is always the same: take text, turn it into token ids, and store those ids on disk in a format that is fast to read during training. We do this four times, once for each kind of training we will do later.

The data pipeline

The four streams are:

Pretraining text from The Pile, stored as a flat array of token ids in an HDF5 file.
Instruction data (Alpaca, Dolly, GSM8K) for SFT, packed into fixed length rows with a mask that says which tokens are the assistant's answer.
Preference pairs (Anthropic HH-RLHF, UltraFeedback) for the reward model and DPO, stored as {prompt, chosen, rejected}.
RL prompts (GSM8K and a small arithmetic warm-up) for PPO and GRPO, stored as {prompt, gold}.

Tokenization

We use the r50k_base tokenizer from OpenAI's tiktoken, the same one GPT-3 used. Text becomes a list of integers, and we append a special <|endoftext|> token (id 50256) at the end of every document so the model learns where one piece of text stops and the next begins.

Tokenization

For the legacy path, download a slice of The Pile and tokenize it into HDF5:

python scripts/data_download.py            # downloads the validation file + 1 training shard
python scripts/data_preprocess.py          # tokenizes to data/train/pile_train.h5 and data/val/pile_dev.h5

The newer, faster path streams and batch-encodes the same data straight into a flat token array:

python scripts/prepare_pretrain_data.py --split val   --out data/pile_dev.h5
python scripts/prepare_pretrain_data.py --split train --num_shards 1 --out data/pile_train.h5

Once tokenized, the data is just a long line of integers. Here is a real peek at the validation file I prepared for this README (8.76 million tokens), the first ten ids, and what they decode back to:

#### OUTPUT ####
dtype: int32 | shape: (8762951,) | total tokens: 8762951
first 10 token ids: [18610, 286, 3993, 3081, 319, 4088, 11, 4640, 2163, 11]
decoded back to text:
'Effect of sleep quality on memory, executive function, and language
 performance in patients with refractory focal epilepsy ...'

That is the whole idea of tokenization in one output: text in, a flat array of integers out, and the integers decode straight back to the original words.

The chat format and loss mask

For everything after pretraining the model has to know who is talking. The r50k_base tokenizer has only one special token, so instead of inventing new ones we use plain text role markers that the model simply learns during SFT. A single turn looks like this (see src/post_training/chat_template.py):

<|user|>
{user content}<|endoftext|><|assistant|>
{assistant content}<|endoftext|>

For math and reasoning we ask the assistant to show its work in a fixed structure, because the reinforcement learning reward later checks the number inside the answer tags:

<think>step by step reasoning ...</think><answer>42</answer>

The important trick is the loss mask. When we encode a conversation we also build a 0/1 mask that is 1 only on the assistant tokens (and the <|endoftext|> that ends the turn). That way SFT trains the model to write answers, not to parrot the prompt back. Here is the exact code that builds the ids and the aligned mask:

def encode_chat(messages, add_generation_prompt=False):
    ids, mask = [], []
    for m in messages:
        role = m["role"]
        # Role header is always masked out (we never train the model to emit it).
        header_ids = _encode_ordinary(_header_for(role))
        ids.extend(header_ids)
        mask.extend([0] * len(header_ids))

        content_ids = _encode_ordinary(m["content"])
        is_completion = role == "assistant"
        ids.extend(content_ids)
        mask.extend([1 if is_completion else 0] * len(content_ids))   # train on assistant only

        ids.append(EOT_ID)                                            # turn terminator
        mask.append(1 if is_completion else 0)                        # learn to stop
    return ids, mask

Here is a real rendered conversation and the verifier reward in action, printed from this repo:

#### OUTPUT ####
rendered chat:
<|user|>
What is 13 + 29?<|endoftext|><|assistant|>
<think>13 + 29 = 42</think><answer>42</answer><|endoftext|>

extract_answer("<answer>42</answer>")        -> 42.0
reward_gsm8k("<answer>42</answer>", 42.0)    -> 1.2    # correct AND well formatted
reward_gsm8k("<answer>7</answer>",  42.0)    -> 0.2    # wrong, but it used the format

And here is one real packed SFT row, showing how only the assistant tokens are trained (the mask is 1 on 48 of the 512 tokens in this row):

```python

OUTPUT

tokens shape: (2131, 512) | loss_mask shape: (2131, 512)

Core symbols most depended-on inside this repo

check

called by 25

tests/verify_data_and_eval.py

unwrap

called by 19

src/post_training/utils.py

compute_logprobs

called by 15

src/post_training/rollout.py

amp_autocast

called by 14

src/post_training/utils.py

log

called by 13

src/post_training/logging_utils.py

save_stage_ckpt

called by 12

src/post_training/utils.py

reduce_scalar

called by 12

src/post_training/distributed.py

masked_mean

called by 11

src/post_training/utils.py

Shape

Function 222

Method 30

Class 18

Languages

Python100%

Modules by API surface

scripts/train_transformer.py22 symbols

ui/jobs.py13 symbols

src/post_training/utils.py13 symbols

tests/test_post_training_smoke.py10 symbols

src/post_training/distributed.py10 symbols

src/post_training/rollout.py9 symbols

config/post_training_config.py8 symbols

src/post_training/chat_template.py7 symbols

src/models/transformer.py7 symbols

scripts/prepare_sft_data.py7 symbols

tests/verify_data_and_eval.py6 symbols

tests/test_rl_math.py6 symbols

Dependencies from manifests, versioned

h5py1×

numpy1×

requests1×

tiktoken1×

torch1×

tqdm1×

zstandard1×

For agents

$ claude mcp add train-llm-from-scratch \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/FareedKhan-dev/train-llm-from-scratch @main sqlite

Train LLM From Scratch

Table of Contents

Who this is for

Prerequisites and Training Time

Setup

Code Structure

Step 1: Preparing the Data

Tokenization

The chat format and loss mask

OUTPUT

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents