r""" A temporal difference learning agent with expected SARSA (on-policy) [3]_ or TD(0) `Q`-learning (off-policy) [4]_ updates. Notes ----- The expected SARSA on-policy TD(0) update is: .. math:: Q(s, a) \leftarrow Q(s, a) + \eta \left(
(
self,
env,
lr=0.4,
epsilon=0.1,
n_tilings=8,
obs_max=None,
obs_min=None,
grid_dims=[8, 8],
off_policy=False,
temporal_discount=0.99,
)
| 792 | |
| 793 | class TemporalDifferenceAgent(AgentBase): |
| 794 | def __init__( |
| 795 | self, |
| 796 | env, |
| 797 | lr=0.4, |
| 798 | epsilon=0.1, |
| 799 | n_tilings=8, |
| 800 | obs_max=None, |
| 801 | obs_min=None, |
| 802 | grid_dims=[8, 8], |
| 803 | off_policy=False, |
| 804 | temporal_discount=0.99, |
| 805 | ): |
| 806 | r""" |
| 807 | A temporal difference learning agent with expected SARSA (on-policy) [3]_ or |
| 808 | TD(0) `Q`-learning (off-policy) [4]_ updates. |
| 809 | |
| 810 | Notes |
| 811 | ----- |
| 812 | The expected SARSA on-policy TD(0) update is: |
| 813 | |
| 814 | .. math:: |
| 815 | |
| 816 | Q(s, a) \leftarrow Q(s, a) + \eta \left( |
| 817 | r + \gamma \mathbb{E}_\pi[Q(s', a') \mid s'] - Q(s, a) |
| 818 | \right) |
| 819 | |
| 820 | and the TD(0) off-policy Q-learning upate is: |
| 821 | |
| 822 | .. math:: |
| 823 | |
| 824 | Q(s, a) \leftarrow Q(s, a) + \eta ( |
| 825 | r + \gamma \max_a \left\{ Q(s', a) \right\} - Q(s, a) |
| 826 | ) |
| 827 | |
| 828 | where in each case we have taken action `a` in state `s`, received |
| 829 | reward `r`, and transitioned into state :math:`s'`. In the above |
| 830 | equations, :math:`\eta` is a learning rate parameter, :math:`\gamma` is |
| 831 | a temporal discount factor, and :math:`\mathbb{E}_\pi[ Q[s', a'] \mid |
| 832 | s']` is the expected value under the current policy :math:`\pi` of the |
| 833 | Q function conditioned that we are in state :math:`s'`. |
| 834 | |
| 835 | Observe that the expected SARSA update can be used for both on- and |
| 836 | off-policy methods. In an off-policy context, if the target policy is |
| 837 | greedy and the expectation is taken wrt. the target policy then the |
| 838 | expected SARSA update is exactly Q-learning. |
| 839 | |
| 840 | NB. For this implementation the agent requires a discrete action |
| 841 | space, but will try to discretize the observation space via tiling if |
| 842 | it is continuous. |
| 843 | |
| 844 | References |
| 845 | ---------- |
| 846 | .. [3] Rummery, G. & Niranjan, M. (1994). *On-Line Q-learning Using |
| 847 | Connectionist Systems*. Tech Report 166. Cambridge University |
| 848 | Department of Engineering. |
| 849 | .. [4] Watkins, C. (1989). Learning from delayed rewards. *PhD thesis, |
| 850 | Cambridge University*. |
| 851 |
nothing calls this directly
no test coverage detected