File train_ppo.py

scripts/train_ppo.py:None–None · view source on GitHub ↗

Source from the content-addressed store, hash-verified

1	"""
2	PPO RLHF on GSM8K (the classic InstructGPT recipe), from scratch.
3
4	Per iteration: roll out completions with the current policy, score them (verifiable GSM8K

nothing calls this directly

mainFunction · 0.70

no test coverage detected