MCPcopy Index your code
hub / github.com/ddbourgin/numpy-ml / __init__

Method __init__

numpy_ml/rl_models/agents.py:794–899  ·  view source on GitHub ↗

r""" A temporal difference learning agent with expected SARSA (on-policy) [3]_ or TD(0) `Q`-learning (off-policy) [4]_ updates. Notes ----- The expected SARSA on-policy TD(0) update is: .. math:: Q(s, a) \leftarrow Q(s, a) + \eta \left(

(
        self,
        env,
        lr=0.4,
        epsilon=0.1,
        n_tilings=8,
        obs_max=None,
        obs_min=None,
        grid_dims=[8, 8],
        off_policy=False,
        temporal_discount=0.99,
    )

Source from the content-addressed store, hash-verified

792
793class TemporalDifferenceAgent(AgentBase):
794 def __init__(
795 self,
796 env,
797 lr=0.4,
798 epsilon=0.1,
799 n_tilings=8,
800 obs_max=None,
801 obs_min=None,
802 grid_dims=[8, 8],
803 off_policy=False,
804 temporal_discount=0.99,
805 ):
806 r"""
807 A temporal difference learning agent with expected SARSA (on-policy) [3]_ or
808 TD(0) `Q`-learning (off-policy) [4]_ updates.
809
810 Notes
811 -----
812 The expected SARSA on-policy TD(0) update is:
813
814 .. math::
815
816 Q(s, a) \leftarrow Q(s, a) + \eta \left(
817 r + \gamma \mathbb{E}_\pi[Q(s', a') \mid s'] - Q(s, a)
818 \right)
819
820 and the TD(0) off-policy Q-learning upate is:
821
822 .. math::
823
824 Q(s, a) \leftarrow Q(s, a) + \eta (
825 r + \gamma \max_a \left\{ Q(s', a) \right\} - Q(s, a)
826 )
827
828 where in each case we have taken action `a` in state `s`, received
829 reward `r`, and transitioned into state :math:`s'`. In the above
830 equations, :math:`\eta` is a learning rate parameter, :math:`\gamma` is
831 a temporal discount factor, and :math:`\mathbb{E}_\pi[ Q[s', a'] \mid
832 s']` is the expected value under the current policy :math:`\pi` of the
833 Q function conditioned that we are in state :math:`s'`.
834
835 Observe that the expected SARSA update can be used for both on- and
836 off-policy methods. In an off-policy context, if the target policy is
837 greedy and the expectation is taken wrt. the target policy then the
838 expected SARSA update is exactly Q-learning.
839
840 NB. For this implementation the agent requires a discrete action
841 space, but will try to discretize the observation space via tiling if
842 it is continuous.
843
844 References
845 ----------
846 .. [3] Rummery, G. & Niranjan, M. (1994). *On-Line Q-learning Using
847 Connectionist Systems*. Tech Report 166. Cambridge University
848 Department of Engineering.
849 .. [4] Watkins, C. (1989). Learning from delayed rewards. *PhD thesis,
850 Cambridge University*.
851

Callers

nothing calls this directly

Calls 2

_init_paramsMethod · 0.95
__init__Method · 0.45

Tested by

no test coverage detected