MCPcopy
hub / github.com/policy-gradient/GRPO-Zero / reward_function

Function reward_function

countdown_task.py:148–166  ·  view source on GitHub ↗

Reward function for Countdown Tasks. Total reward = 0.1 * format_reward + answer_reward

(
    response: str,
    numbers: List[int] = None,
    target: int = None,
    end_token: str = None,
)

Source from the content-addressed store, hash-verified

146
147
148def reward_function(
149 response: str,
150 numbers: List[int] = None,
151 target: int = None,
152 end_token: str = None,
153) -> Dict[str, Any]:
154 """Reward function for Countdown Tasks.
155
156 Total reward = 0.1 * format_reward + answer_reward
157 """
158 format_reward = format_reward_function("<think>" + response, end_token)
159 answer_reward = answer_reward_function(response, numbers, target)
160 return {
161 "reward": format_reward * 0.1 + answer_reward,
162 "reward_info": {
163 "format_reward": format_reward,
164 "answer_reward": answer_reward,
165 },
166 }

Callers 1

rolloutFunction · 0.85

Calls 2

format_reward_functionFunction · 0.85
answer_reward_functionFunction · 0.85

Tested by

no test coverage detected