Commit 61b69bcc by haoyifan
parents 4f5ef212 3c299656
......@@ -122,9 +122,9 @@ use the predicted result $\hat{t}$ of the listener agent as the
evidence of whether giving positive rewards. Then, the gradients of the
expected reward $ J(\theta_S, \theta_L)$ can be calculated as follows:
\begin{align}
\nabla_{\theta^S} J &= \mathbb{E}_{\pi^S_{old}, \pi^L} \left[ r(\hat{t}, t) \cdot
\nabla_{\theta^S} J &= \mathbb{E}_{\pi^S, \pi^L} \left[ r(\hat{t}, t) \cdot
\frac{\nabla_{\theta^S}\pi^S(s_0, s_1 | t)}{\pi^S_{old}(s_0, s_1 | t)} \right] \\
\nabla_{\theta^L} J &= \mathbb{E}_{\pi^S, \pi^L_{old}} \left[ r(\hat{t}, t) \cdot
\nabla_{\theta^L} J &= \mathbb{E}_{\pi^S, \pi^L} \left[ r(\hat{t}, t) \cdot
\frac{\nabla_{\theta^L} \pi^L(\hat{t} | s_0, s_1)}{\pi^L_{old}(\hat{t} | s_0, s_1)} \right]
\end{align}
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment