hub / github.com/ddbourgin/numpy-ml / __init__

Method init

numpy_ml/rl_models/agents.py:794–899 · view source on GitHub ↗

r""" A temporal difference learning agent with expected SARSA (on-policy) [3]_ or TD(0) `Q`-learning (off-policy) [4]_ updates. Notes ----- The expected SARSA on-policy TD(0) update is: .. math:: Q(s, a) \leftarrow Q(s, a) + \eta \left(

(
        self,
        env,
        lr=0.4,
        epsilon=0.1,
        n_tilings=8,
        obs_max=None,
        obs_min=None,
        grid_dims=[8, 8],
        off_policy=False,
        temporal_discount=0.99,
    )

Source from the content-addressed store, hash-verified

792
793	class TemporalDifferenceAgent(AgentBase):
794	def __init__(
795	self,
796	env,
797	lr=0.4,
798	epsilon=0.1,
799	n_tilings=8,
800	obs_max=None,
801	obs_min=None,
802	grid_dims=[8, 8],
803	off_policy=False,
804	temporal_discount=0.99,
805	):
806	r"""
807	A temporal difference learning agent with expected SARSA (on-policy) [3]_ or
808	TD(0) `Q`-learning (off-policy) [4]_ updates.
809
810	Notes
811	-----
812	The expected SARSA on-policy TD(0) update is:
813
814	.. math::
815
816	Q(s, a) \leftarrow Q(s, a) + \eta \left(
817	r + \gamma \mathbb{E}_\pi[Q(s', a') \mid s'] - Q(s, a)
818	\right)
819
820	and the TD(0) off-policy Q-learning upate is:
821
822	.. math::
823
824	Q(s, a) \leftarrow Q(s, a) + \eta (
825	r + \gamma \max_a \left\{ Q(s', a) \right\} - Q(s, a)
826	)
827
828	where in each case we have taken action `a` in state `s`, received
829	reward `r`, and transitioned into state :math:`s'`. In the above
830	equations, :math:`\eta` is a learning rate parameter, :math:`\gamma` is
831	a temporal discount factor, and :math:`\mathbb{E}_\pi[ Q[s', a'] \mid
832	s']` is the expected value under the current policy :math:`\pi` of the
833	Q function conditioned that we are in state :math:`s'`.
834
835	Observe that the expected SARSA update can be used for both on- and
836	off-policy methods. In an off-policy context, if the target policy is
837	greedy and the expectation is taken wrt. the target policy then the
838	expected SARSA update is exactly Q-learning.
839
840	NB. For this implementation the agent requires a discrete action
841	space, but will try to discretize the observation space via tiling if
842	it is continuous.
843
844	References
845	----------
846	.. [3] Rummery, G. & Niranjan, M. (1994). *On-Line Q-learning Using
847	Connectionist Systems*. Tech Report 166. Cambridge University
848	Department of Engineering.
849	.. [4] Watkins, C. (1989). Learning from delayed rewards. *PhD thesis,
850	Cambridge University*.
851

Callers

nothing calls this directly

Calls 2

_init_paramsMethod · 0.95

__init__Method · 0.45

Tested by

no test coverage detected

Method __init__

Source from the content-addressed store, hash-verified

Callers

Calls 2

Tested by

Method init