r""" A Dyna-`Q` / Dyna-`Q+` agent [5]_ with full TD(0) `Q`-learning updates via prioritized-sweeping [6]_ . Notes ----- This approach consists of three components: a planning method involving simulated actions, a direct RL method where the agent direc
(
self,
env,
lr=0.4,
epsilon=0.1,
n_tilings=8,
obs_max=None,
obs_min=None,
q_plus=False,
grid_dims=[8, 8],
explore_weight=0.05,
temporal_discount=0.9,
n_simulated_actions=50,
)
| 1303 | |
| 1304 | class DynaAgent(AgentBase): |
| 1305 | def __init__( |
| 1306 | self, |
| 1307 | env, |
| 1308 | lr=0.4, |
| 1309 | epsilon=0.1, |
| 1310 | n_tilings=8, |
| 1311 | obs_max=None, |
| 1312 | obs_min=None, |
| 1313 | q_plus=False, |
| 1314 | grid_dims=[8, 8], |
| 1315 | explore_weight=0.05, |
| 1316 | temporal_discount=0.9, |
| 1317 | n_simulated_actions=50, |
| 1318 | ): |
| 1319 | r""" |
| 1320 | A Dyna-`Q` / Dyna-`Q+` agent [5]_ with full TD(0) `Q`-learning updates via |
| 1321 | prioritized-sweeping [6]_ . |
| 1322 | |
| 1323 | Notes |
| 1324 | ----- |
| 1325 | This approach consists of three components: a planning method involving |
| 1326 | simulated actions, a direct RL method where the agent directly interacts |
| 1327 | with the environment, and a model-learning method where the agent |
| 1328 | learns to better represent the environment during planning. |
| 1329 | |
| 1330 | During planning, the agent performs random-sample one-step tabular |
| 1331 | Q-planning with prioritized sweeping. This entails using a priority |
| 1332 | queue to retrieve the state-action pairs from the agent's history which |
| 1333 | would stand to have the largest change to their Q-values if backed up. |
| 1334 | Specifically, for state action pair `(s, a)` the priority value is: |
| 1335 | |
| 1336 | .. math:: |
| 1337 | |
| 1338 | P = \sum_{s'} p(s') | r + \gamma \max_a \{Q(s', a) \} - Q(s, a) | |
| 1339 | |
| 1340 | which corresponds to the absolute magnitude of the TD(0) Q-learning |
| 1341 | backup for the pair. |
| 1342 | |
| 1343 | When the first pair in the queue is backed up, the effect on each of |
| 1344 | its predecessor pairs is computed. If the predecessor's priority is |
| 1345 | greater than a small threshold the pair is added to the queue and the |
| 1346 | process is repeated until either the queue is empty or we have exceeded |
| 1347 | `n_simulated_actions` updates. These backups occur without the agent |
| 1348 | taking any action in the environment and thus constitute simulations |
| 1349 | based on the agent's current model of the environment (i.e., its |
| 1350 | tabular state-action history). |
| 1351 | |
| 1352 | During the direct RL phase, the agent takes an action based on its |
| 1353 | current behavior policy and Q function and receives a reward from the |
| 1354 | environment. The agent logs this state-action-reward-new state tuple in |
| 1355 | its interaction table (i.e., environment model) and updates its Q |
| 1356 | function using a full-backup version of the Q-learning update: |
| 1357 | |
| 1358 | .. math:: |
| 1359 | |
| 1360 | Q(s, a) \leftarrow Q(s, a) + \eta \sum_{r, s'} p(r, s' \mid s, a) |
| 1361 | \left(r + \gamma \max_a \{ Q(s', a) \} - Q(s, a) \right) |
| 1362 |
nothing calls this directly
no test coverage detected