r""" A cross-entropy method agent. Notes ----- The cross-entropy method [1]_ [2]_ agent only operates on ``envs`` with discrete action spaces. On each episode the agent generates `n_theta_samples` of the parameters (:math:`\theta`) for its be
(self, env, n_samples_per_episode=500, retain_prcnt=0.2)
| 100 | |
| 101 | class CrossEntropyAgent(AgentBase): |
| 102 | def __init__(self, env, n_samples_per_episode=500, retain_prcnt=0.2): |
| 103 | r""" |
| 104 | A cross-entropy method agent. |
| 105 | |
| 106 | Notes |
| 107 | ----- |
| 108 | The cross-entropy method [1]_ [2]_ agent only operates on ``envs`` with |
| 109 | discrete action spaces. |
| 110 | |
| 111 | On each episode the agent generates `n_theta_samples` of the parameters |
| 112 | (:math:`\theta`) for its behavior policy. The `i`'th sample at |
| 113 | timestep `t` is: |
| 114 | |
| 115 | .. math:: |
| 116 | |
| 117 | \theta_i &= \{\mathbf{W}_i^{(t)}, \mathbf{b}_i^{(t)} \} \\ |
| 118 | \theta_i &\sim \mathcal{N}(\mu^{(t)}, \Sigma^{(t)}) |
| 119 | |
| 120 | Weights (:math:`\mathbf{W}_i`) and bias (:math:`\mathbf{b}_i`) are the |
| 121 | parameters of the softmax policy: |
| 122 | |
| 123 | .. math:: |
| 124 | |
| 125 | \mathbf{z}_i &= \text{obs} \cdot \mathbf{W}_i + \mathbf{b}_i \\ |
| 126 | p(a_i^{(t + 1)}) &= \frac{e^{\mathbf{z}_i}}{\sum_j e^{z_{ij}}} \\ |
| 127 | a^{(t + 1)} &= \arg \max_j p(a_j^{(t+1)}) |
| 128 | |
| 129 | At the end of each episode, the agent takes the top `retain_prcnt` |
| 130 | highest scoring :math:`\theta` samples and combines them to generate |
| 131 | the mean and variance of the distribution of :math:`\theta` for the |
| 132 | next episode: |
| 133 | |
| 134 | .. math:: |
| 135 | |
| 136 | \mu^{(t+1)} &= \text{avg}(\texttt{best_thetas}^{(t)}) \\ |
| 137 | \Sigma^{(t+1)} &= \text{var}(\texttt{best_thetas}^{(t)}) |
| 138 | |
| 139 | References |
| 140 | ---------- |
| 141 | .. [1] Mannor, S., Rubinstein, R., & Gat, Y. (2003). The cross entropy |
| 142 | method for fast policy search. In *Proceedings of the 20th Annual |
| 143 | ICML, 20*. |
| 144 | .. [2] Rubinstein, R. (1997). optimization of computer simulation |
| 145 | models with rare events, *European Journal of Operational Research, |
| 146 | 99*, 89–112. |
| 147 | |
| 148 | Parameters |
| 149 | ---------- |
| 150 | env : :meth:`gym.wrappers` or :meth:`gym.envs` instance |
| 151 | The environment to run the agent on. |
| 152 | n_samples_per_episode : int |
| 153 | The number of theta samples to evaluate on each episode. Default is 500. |
| 154 | retain_prcnt: float |
| 155 | The percentage of `n_samples_per_episode` to use when calculating |
| 156 | the parameter update at the end of the episode. Default is 0.2. |
| 157 | """ |
| 158 | super().__init__(env) |
| 159 |
nothing calls this directly
no test coverage detected