MCPcopy
hub / github.com/FareedKhan-dev/train-llm-from-scratch / configure_optimizer

Function configure_optimizer

src/post_training/optim.py:17–37  ·  view source on GitHub ↗

AdamW with weight decay applied only to >=2D parameters (matrices), not to biases / norms / 1D params. Standard GPT recipe.

(
    model: nn.Module,
    lr: float,
    weight_decay: float,
    betas: tuple[float, float] = (0.9, 0.95),
)

Source from the content-addressed store, hash-verified

15
16
17def configure_optimizer(
18 model: nn.Module,
19 lr: float,
20 weight_decay: float,
21 betas: tuple[float, float] = (0.9, 0.95),
22) -> torch.optim.AdamW:
23 """AdamW with weight decay applied only to >=2D parameters (matrices), not to
24 biases / norms / 1D params. Standard GPT recipe."""
25 decay, no_decay = [], []
26 for name, p in model.named_parameters():
27 if not p.requires_grad:
28 continue
29 if p.dim() >= 2:
30 decay.append(p)
31 else:
32 no_decay.append(p)
33 groups = [
34 {"params": decay, "weight_decay": weight_decay},
35 {"params": no_decay, "weight_decay": 0.0},
36 ]
37 return torch.optim.AdamW(groups, lr=lr, betas=betas)
38
39
40def cosine_lr(step: int, *, warmup_steps: int, max_steps: int, lr: float, min_lr: float) -> float:

Callers 6

mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90

Calls

no outgoing calls

Tested by

no test coverage detected