MCPcopy
hub / github.com/FareedKhan-dev/train-llm-from-scratch / k3_kl

Function k3_kl

src/post_training/grpo.py:31–34  ·  view source on GitHub ↗

Per-token unbiased, non-negative KL estimator (Schulman's k3) for KL(policy||ref).

(new_logp: torch.Tensor, ref_logp: torch.Tensor)

Source from the content-addressed store, hash-verified

29
30
31def k3_kl(new_logp: torch.Tensor, ref_logp: torch.Tensor) -> torch.Tensor:
32 """Per-token unbiased, non-negative KL estimator (Schulman's k3) for KL(policy||ref)."""
33 diff = ref_logp - new_logp
34 return torch.exp(diff) - diff - 1.0
35
36
37def grpo_loss(

Callers 2

test_grpo_loss_and_klFunction · 0.90
grpo_lossFunction · 0.85

Calls

no outgoing calls

Tested by 1

test_grpo_loss_and_klFunction · 0.72