hub / github.com/ddbourgin/numpy-ml / __init__

Method init

numpy_ml/rl_models/agents.py:1305–1426 · view source on GitHub ↗

r""" A Dyna-`Q` / Dyna-`Q+` agent [5]_ with full TD(0) `Q`-learning updates via prioritized-sweeping [6]_ . Notes ----- This approach consists of three components: a planning method involving simulated actions, a direct RL method where the agent direc

(
        self,
        env,
        lr=0.4,
        epsilon=0.1,
        n_tilings=8,
        obs_max=None,
        obs_min=None,
        q_plus=False,
        grid_dims=[8, 8],
        explore_weight=0.05,
        temporal_discount=0.9,
        n_simulated_actions=50,
    )

Source from the content-addressed store, hash-verified

1303
1304	class DynaAgent(AgentBase):
1305	def __init__(
1306	self,
1307	env,
1308	lr=0.4,
1309	epsilon=0.1,
1310	n_tilings=8,
1311	obs_max=None,
1312	obs_min=None,
1313	q_plus=False,
1314	grid_dims=[8, 8],
1315	explore_weight=0.05,
1316	temporal_discount=0.9,
1317	n_simulated_actions=50,
1318	):
1319	r"""
1320	A Dyna-`Q` / Dyna-`Q+` agent [5]_ with full TD(0) `Q`-learning updates via
1321	prioritized-sweeping [6]_ .
1322
1323	Notes
1324	-----
1325	This approach consists of three components: a planning method involving
1326	simulated actions, a direct RL method where the agent directly interacts
1327	with the environment, and a model-learning method where the agent
1328	learns to better represent the environment during planning.
1329
1330	During planning, the agent performs random-sample one-step tabular
1331	Q-planning with prioritized sweeping. This entails using a priority
1332	queue to retrieve the state-action pairs from the agent's history which
1333	would stand to have the largest change to their Q-values if backed up.
1334	Specifically, for state action pair `(s, a)` the priority value is:
1335
1336	.. math::
1337
1338	P = \sum_{s'} p(s') \| r + \gamma \max_a \{Q(s', a) \} - Q(s, a) \|
1339
1340	which corresponds to the absolute magnitude of the TD(0) Q-learning
1341	backup for the pair.
1342
1343	When the first pair in the queue is backed up, the effect on each of
1344	its predecessor pairs is computed. If the predecessor's priority is
1345	greater than a small threshold the pair is added to the queue and the
1346	process is repeated until either the queue is empty or we have exceeded
1347	`n_simulated_actions` updates. These backups occur without the agent
1348	taking any action in the environment and thus constitute simulations
1349	based on the agent's current model of the environment (i.e., its
1350	tabular state-action history).
1351
1352	During the direct RL phase, the agent takes an action based on its
1353	current behavior policy and Q function and receives a reward from the
1354	environment. The agent logs this state-action-reward-new state tuple in
1355	its interaction table (i.e., environment model) and updates its Q
1356	function using a full-backup version of the Q-learning update:
1357
1358	.. math::
1359
1360	Q(s, a) \leftarrow Q(s, a) + \eta \sum_{r, s'} p(r, s' \mid s, a)
1361	\left(r + \gamma \max_a \{ Q(s', a) \} - Q(s, a) \right)
1362

Callers

nothing calls this directly

Calls 2

_init_paramsMethod · 0.95

__init__Method · 0.45

Tested by

no test coverage detected

Method __init__

Source from the content-addressed store, hash-verified

Callers

Calls 2

Tested by

Method init