Update theory.tex

3c299656 · Ruizhi Chen · 2ac2b97f · 3c299656
Commit 3c299656 authored Sep 10, 2020 by Ruizhi Chen
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 2 deletions

AAAI2021/tex/theory.tex
+2 -2

No files found.
--- a/AAAI2021/tex/theory.tex
+++ b/AAAI2021/tex/theory.tex
@@ -120,9 +120,9 @@ use the predicted result $\hat{t}$ of the listener agent as the
 evidence of whether giving positive rewards. Then, the gradients of the
 expected reward $ J(\theta_S, \theta_L)$ can be calculated as follows:
 \begin{align}
-  \nabla_{\theta^S} J &= \mathbb{E}_{\pi^S_{old}, \pi^L} \left[ r(\hat{t}, t) \cdot
+  \nabla_{\theta^S} J &= \mathbb{E}_{\pi^S, \pi^L} \left[ r(\hat{t}, t) \cdot
     \frac{\nabla_{\theta^S}\pi^S(s_0, s_1 | t)}{\pi^S_{old}(s_0, s_1 | t)} \right] \\
-  \nabla_{\theta^L} J &= \mathbb{E}_{\pi^S, \pi^L_{old}} \left[ r(\hat{t}, t) \cdot
+  \nabla_{\theta^L} J &= \mathbb{E}_{\pi^S, \pi^L} \left[ r(\hat{t}, t) \cdot
    \frac{\nabla_{\theta^L} \pi^L(\hat{t} | s_0, s_1)}{\pi^L_{old}(\hat{t} | s_0, s_1)} \right]
 \end{align}