@@ -16,13 +16,13 @@ including the environment setup, agent architecture, and training algorithm.
\subsection{Environment setup}
\label{ssec:env}
Figure~\cite{fig:game} shows the entire environment used in this study,
i.e., a common used referential game. Roughly, the referential game requires the speaker and
i.e., a commonly used referential game. Roughly, the referential game requires the speaker and
listener working cooperatively to accomplish a certain task.
In this paper, the task is xxxx.
\textbf{Game rules} In our referential game, agents follow the following rules
to finish the game in a cooperatively manner. In each round, once received an
to finish the game in a cooperative manner. In each round, once received an
input object $t$, Speaker $S$ speaks a symbol sequence $s$ to Listener $L$ ;
Listener $L$ reconstruct the predicted result $\hat{t}$ based on the listened
sequence $s$; if $t=\hat{t}$, agents win this game and receive positive rewards
...
...
@@ -59,18 +59,14 @@ including the Speaker $S$ and Listener $L$.
\textbf{Speaker.} Regarding the Speaker $S$, it is constructed as a three-layer neural
network. The Speaker $S$ processes the input object $t$ with a fully-connected
layer to obtain the hidden layer $h^s$, which is split into two sub-layers. Each
sub-layer is further processed with fully-connected layers to obtain the output
layer to obtain the hidden layer $h^s$, which is further processed with fully-connected layers to obtain the output
layer. The output layer results indicate the probability distribution of symbols
with given input object $t$, i.e., $o_i^{s}=P(s_i|t)$$i\in{0,1}$. \note{The final
readout symbols are sampled based on such probability distribution.}
\textbf{Listener.} Regarding the Listener $L$, it is constructed as a
three-layer neural network, too. Different from Speaker $S$ that split the
hidden layer into two sub-layers, $L$ concatenates two sub-layers into one
output layer. The output layer results are also the probability distribution of
three-layer neural network, too. Different from Speaker $S$ that tries to separate input object into words, $L$ tries to concatenates words to understand the combined meaning. The output layer results are also the probability distribution of
symbols $\hat{t}$ with given input sequence $s$, i.e, $o^{L}=P(\hat{t}|s_0,s_1)$.
\note{The final readout symbol is sampled based the probability.}
...
...
@@ -79,7 +75,7 @@ symbols $\hat{t}$ with given input sequence $s$, i.e, $o^{L}=P(\hat{t}|s_0,s_1)$
To remove all the handcrafted induction as well as for a more realistic
scenario, agents for this referential game are independent to each other,
scenario, agents for this referential game are independent of each other,
without sharing model parameters or architectural connections. As shown in
Algorithm~\ref{al:learning}, we train the separate Speaker $S$ and Listener $L$ with
Stochastic Policy Gradient methodology in a tick-tock manner, i.e, training one
...
...
@@ -90,13 +86,13 @@ $\theta_S$, where $\theta_S$ is the neural network parameters of Speaker $S$
with learned output probability distribution $\pi_S$, and $\theta_L$ is the
neural network parameters of Listener with learned probability distribution $\pi_L$.
Similarly, when training the Listener, the target is set to maximize the
expected reward$ J(theta_S, theta_L)$ by fixing the parameter $\theta_S$ and
expected reward$ J(\theta_S, \theta_L)$ by fixing the parameter $\theta_S$ and
adjusting the parameter $\theta_L$.
Additionally, to avoid the handcrafted induction on emergent language, we only
use the predict result $\hat{t}$ of the listener agent as the
evidence of whether giving the positive rewards. Then, the gradients of the
expected reward $ J(theta_S, theta_L)$ can be calculated as follows:
expected reward $ J(\theta_S, \theta_L)$ can be calculated as follows: