Commit 8b694c7a by Zidong Du
parents db11311b 153da1e2
......@@ -83,7 +83,7 @@ Algorithm~\ref{al:learning}, we train the separate Speaker $S$ and Listener $L$
Stochastic Policy Gradient methodology in a tick-tock manner, i.e, training one
agent while keeping the other one. Roughly, when training the Speaker, the
target is set to maximize the expected reward
$J(\theta_S, \theta_L)=E_{\pi_S,\pi_L}[R(t, t^)]$ by adjusting the parameter
$J(\theta_S, \theta_L)=E_{\pi_S,\pi_L}[R(t, \hat{t})]$ by adjusting the parameter
$\theta_S$, where $\theta_S$ is the neural network parameters of Speaker $S$
with learned output probability distribution $\pi_S$, and $\theta_L$ is the
neural network parameters of Listener with learned probability distribution $\pi_L$.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment