Commit 8b694c7a by Zidong Du
parents db11311b 153da1e2
...@@ -83,7 +83,7 @@ Algorithm~\ref{al:learning}, we train the separate Speaker $S$ and Listener $L$ ...@@ -83,7 +83,7 @@ Algorithm~\ref{al:learning}, we train the separate Speaker $S$ and Listener $L$
Stochastic Policy Gradient methodology in a tick-tock manner, i.e, training one Stochastic Policy Gradient methodology in a tick-tock manner, i.e, training one
agent while keeping the other one. Roughly, when training the Speaker, the agent while keeping the other one. Roughly, when training the Speaker, the
target is set to maximize the expected reward target is set to maximize the expected reward
$J(\theta_S, \theta_L)=E_{\pi_S,\pi_L}[R(t, t^)]$ by adjusting the parameter $J(\theta_S, \theta_L)=E_{\pi_S,\pi_L}[R(t, \hat{t})]$ by adjusting the parameter
$\theta_S$, where $\theta_S$ is the neural network parameters of Speaker $S$ $\theta_S$, where $\theta_S$ is the neural network parameters of Speaker $S$
with learned output probability distribution $\pi_S$, and $\theta_L$ is the with learned output probability distribution $\pi_S$, and $\theta_L$ is the
neural network parameters of Listener with learned probability distribution $\pi_L$. neural network parameters of Listener with learned probability distribution $\pi_L$.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment