A Monte-Carlo learning agent trained using either first-visit Monte Carlo updates (on-policy) or incremental weighted importance sampling (off-policy). Parameters ---------- env : :class:`gym.wrappers` or :class:`gym.envs` instance The en
(self, env, off_policy=False, temporal_discount=0.9, epsilon=0.1)
| 390 | |
| 391 | class MonteCarloAgent(AgentBase): |
| 392 | def __init__(self, env, off_policy=False, temporal_discount=0.9, epsilon=0.1): |
| 393 | """ |
| 394 | A Monte-Carlo learning agent trained using either first-visit Monte |
| 395 | Carlo updates (on-policy) or incremental weighted importance sampling |
| 396 | (off-policy). |
| 397 | |
| 398 | Parameters |
| 399 | ---------- |
| 400 | env : :class:`gym.wrappers` or :class:`gym.envs` instance |
| 401 | The environment to run the agent on. |
| 402 | off_policy : bool |
| 403 | Whether to use a behavior policy separate from the target policy |
| 404 | during training. If False, use the same epsilon-soft policy for |
| 405 | both behavior and target policies. Default is False. |
| 406 | temporal_discount : float between [0, 1] |
| 407 | The discount factor used for downweighting future rewards. Smaller |
| 408 | values result in greater discounting of future rewards. Default is |
| 409 | 0.9. |
| 410 | epsilon : float between [0, 1] |
| 411 | The epsilon value in the epsilon-soft policy. Larger values |
| 412 | encourage greater exploration during training. Default is 0.1. |
| 413 | """ |
| 414 | super().__init__(env) |
| 415 | |
| 416 | self.epsilon = epsilon |
| 417 | self.off_policy = off_policy |
| 418 | self.temporal_discount = temporal_discount |
| 419 | |
| 420 | self._init_params() |
| 421 | |
| 422 | def _init_params(self): |
| 423 | E = self.env_info |
nothing calls this directly
no test coverage detected