MCPcopy Index your code
hub / github.com/ddbourgin/numpy-ml / __init__

Method __init__

numpy_ml/rl_models/agents.py:1305–1426  ·  view source on GitHub ↗

r""" A Dyna-`Q` / Dyna-`Q+` agent [5]_ with full TD(0) `Q`-learning updates via prioritized-sweeping [6]_ . Notes ----- This approach consists of three components: a planning method involving simulated actions, a direct RL method where the agent direc

(
        self,
        env,
        lr=0.4,
        epsilon=0.1,
        n_tilings=8,
        obs_max=None,
        obs_min=None,
        q_plus=False,
        grid_dims=[8, 8],
        explore_weight=0.05,
        temporal_discount=0.9,
        n_simulated_actions=50,
    )

Source from the content-addressed store, hash-verified

1303
1304class DynaAgent(AgentBase):
1305 def __init__(
1306 self,
1307 env,
1308 lr=0.4,
1309 epsilon=0.1,
1310 n_tilings=8,
1311 obs_max=None,
1312 obs_min=None,
1313 q_plus=False,
1314 grid_dims=[8, 8],
1315 explore_weight=0.05,
1316 temporal_discount=0.9,
1317 n_simulated_actions=50,
1318 ):
1319 r"""
1320 A Dyna-`Q` / Dyna-`Q+` agent [5]_ with full TD(0) `Q`-learning updates via
1321 prioritized-sweeping [6]_ .
1322
1323 Notes
1324 -----
1325 This approach consists of three components: a planning method involving
1326 simulated actions, a direct RL method where the agent directly interacts
1327 with the environment, and a model-learning method where the agent
1328 learns to better represent the environment during planning.
1329
1330 During planning, the agent performs random-sample one-step tabular
1331 Q-planning with prioritized sweeping. This entails using a priority
1332 queue to retrieve the state-action pairs from the agent's history which
1333 would stand to have the largest change to their Q-values if backed up.
1334 Specifically, for state action pair `(s, a)` the priority value is:
1335
1336 .. math::
1337
1338 P = \sum_{s'} p(s') | r + \gamma \max_a \{Q(s', a) \} - Q(s, a) |
1339
1340 which corresponds to the absolute magnitude of the TD(0) Q-learning
1341 backup for the pair.
1342
1343 When the first pair in the queue is backed up, the effect on each of
1344 its predecessor pairs is computed. If the predecessor's priority is
1345 greater than a small threshold the pair is added to the queue and the
1346 process is repeated until either the queue is empty or we have exceeded
1347 `n_simulated_actions` updates. These backups occur without the agent
1348 taking any action in the environment and thus constitute simulations
1349 based on the agent's current model of the environment (i.e., its
1350 tabular state-action history).
1351
1352 During the direct RL phase, the agent takes an action based on its
1353 current behavior policy and Q function and receives a reward from the
1354 environment. The agent logs this state-action-reward-new state tuple in
1355 its interaction table (i.e., environment model) and updates its Q
1356 function using a full-backup version of the Q-learning update:
1357
1358 .. math::
1359
1360 Q(s, a) \leftarrow Q(s, a) + \eta \sum_{r, s'} p(r, s' \mid s, a)
1361 \left(r + \gamma \max_a \{ Q(s', a) \} - Q(s, a) \right)
1362

Callers

nothing calls this directly

Calls 2

_init_paramsMethod · 0.95
__init__Method · 0.45

Tested by

no test coverage detected