Commit 103e85ae by Zidong Du

~

parent ef75968e
...@@ -19,7 +19,7 @@ is around 0.8 when $h_{size}\le 20$; MIS significantly decreases to 0.75 when ...@@ -19,7 +19,7 @@ is around 0.8 when $h_{size}\le 20$; MIS significantly decreases to 0.75 when
$h_{size}$ increases from 20 to 40; MIS further reduces to 0.7 when $h_{size}$ $h_{size}$ increases from 20 to 40; MIS further reduces to 0.7 when $h_{size}$
increases from 40 to 100. increases from 40 to 100.
For different vocabulary sizes, the MIS shares the For different vocabulary sizes, the MIS shares the
similar behaviour. similar behavior.
It is because symbols in low-compositional languages carry semantic information It is because symbols in low-compositional languages carry semantic information
about more concepts. As a result, higher capacity is required to characterize the about more concepts. As a result, higher capacity is required to characterize the
complex semantic information for low-compositional language to emerge. complex semantic information for low-compositional language to emerge.
...@@ -41,7 +41,7 @@ We further breakdown our results to investigate the importance of agent capacity ...@@ -41,7 +41,7 @@ We further breakdown our results to investigate the importance of agent capacity
to the compositionality of symbolic language. Figure~\ref{fig:exp2} reports the to the compositionality of symbolic language. Figure~\ref{fig:exp2} reports the
ratio of high compositional symbolic language in all emerged languages, ratio of high compositional symbolic language in all emerged languages,
Figure~\ref{fig:exp2} (a) and (b) for $MIS>0.99$ and $MIS>0.9$, respectively. It Figure~\ref{fig:exp2} (a) and (b) for $MIS>0.99$ and $MIS>0.9$, respectively. It
cam be observed that the ratio of high compositional symbolic languages can be observed that the ratio of high compositional symbolic languages
decreases drastically with the increase of $h_{size}$. Especially, when $h_size$ decreases drastically with the increase of $h_{size}$. Especially, when $h_size$
is large enough (e.g., $>40$), high compositional symbolic language is hard to is large enough (e.g., $>40$), high compositional symbolic language is hard to
emerge in a natural referential game. emerge in a natural referential game.
...@@ -90,19 +90,19 @@ Figure~\ref{fig:bench}. ...@@ -90,19 +90,19 @@ Figure~\ref{fig:bench}.
%\end{figure} %\end{figure}
Figure~\ref{fig:exp3} reports the accurcy of Listener, i.e., correctly Figure~\ref{fig:exp3} reports the accuracy of Listener, i.e., ratio of the correctly
predicting the symbols spoken by Speaker ($t=\hat(t)$), which varies with the predicted symbols spoken by Speaker ($t=\hat(t)$), which varies with the
training iterations under different agent capacities. training iterations under different agent capacities.
Figure~\ref{fig:exp3} (a) shows that when $h_size$ equals to 1, the agent capacity is Figure~\ref{fig:exp3} (a) shows that when $h_size$ equals to 1, the agent capacity is
too low to handle languages. Figure~\ref{fig:exp3} (b) shows that when $h_size$ too low to handle languages. Figure~\ref{fig:exp3} (b) shows that when $h_size$
equals to 2, agent can only learn $LA$ whose compositionality (i.e. \emph{MIS}) equals to 2, agent can only learn $LA$ whose compositionality (i.e. \emph{MIS})
is highest in all three languages. Combing these two observations, we can infer that is highest in all three languages. Combing these two observations, we can infer that
language with lower compositionality need higher agent capacity to ensure communicating language with lower compositionality requires higher agent capacity to ensure communicating
successfully (i.e., $h_size$). Figure~\ref{fig:exp3} (c) to (h) show that the successfully (i.e., $h_size$). Figure~\ref{fig:exp3} (c) to (h) show that the
higher agent capacity cause a faster training process for all three languages, but the higher agent capacity causes a faster training process for all three languages, but the
improvement for different languages is quite different. improvement for different languages is quite different.
It is obvious that language with lower compostionality also need higher agent It is obvious that language with lower compositionality also requires higher agent
capacity to training faster. capacity to train faster.
%In conclude, teaching an artificial language with %In conclude, teaching an artificial language with
......
...@@ -16,13 +16,13 @@ including the environment setup, agent architecture, and training algorithm. ...@@ -16,13 +16,13 @@ including the environment setup, agent architecture, and training algorithm.
\subsection{Environment setup} \subsection{Environment setup}
\label{ssec:env} \label{ssec:env}
Figure~\cite{fig:game} shows the entire environment used in this study, Figure~\cite{fig:game} shows the entire environment used in this study,
i.e., a common used referential game. Roughly, the referential game requires the speaker and i.e., a commonly used referential game. Roughly, the referential game requires the speaker and
listener working cooperatively to accomplish a certain task. listener working cooperatively to accomplish a certain task.
In this paper, the task is xxxx. In this paper, the task is xxxx.
\textbf{Game rules} In our referential game, agents follow the following rules \textbf{Game rules} In our referential game, agents follow the following rules
to finish the game in a cooperatively manner. In each round, once received an to finish the game in a cooperative manner. In each round, once received an
input object $t$, Speaker $S$ speaks a symbol sequence $s$ to Listener $L$ ; input object $t$, Speaker $S$ speaks a symbol sequence $s$ to Listener $L$ ;
Listener $L$ reconstruct the predicted result $\hat{t}$ based on the listened Listener $L$ reconstruct the predicted result $\hat{t}$ based on the listened
sequence $s$; if $t=\hat{t}$, agents win this game and receive positive rewards sequence $s$; if $t=\hat{t}$, agents win this game and receive positive rewards
...@@ -59,18 +59,14 @@ including the Speaker $S$ and Listener $L$. ...@@ -59,18 +59,14 @@ including the Speaker $S$ and Listener $L$.
\textbf{Speaker.} Regarding the Speaker $S$, it is constructed as a three-layer neural \textbf{Speaker.} Regarding the Speaker $S$, it is constructed as a three-layer neural
network. The Speaker $S$ processes the input object $t$ with a fully-connected network. The Speaker $S$ processes the input object $t$ with a fully-connected
layer to obtain the hidden layer $h^s$, which is split into two sub-layers. Each layer to obtain the hidden layer $h^s$, which is further processed with fully-connected layers to obtain the output
sub-layer is further processed with fully-connected layers to obtain the output
layer. The output layer results indicate the probability distribution of symbols layer. The output layer results indicate the probability distribution of symbols
with given input object $t$, i.e., $o_i^{s}=P(s_i|t)$ $i\in{0,1}$. \note{The final with given input object $t$, i.e., $o_i^{s}=P(s_i|t)$ $i\in{0,1}$. \note{The final
readout symbols are sampled based on such probability distribution.} readout symbols are sampled based on such probability distribution.}
\textbf{Listener.} Regarding the Listener $L$, it is constructed as a \textbf{Listener.} Regarding the Listener $L$, it is constructed as a
three-layer neural network, too. Different from Speaker $S$ that split the three-layer neural network, too. Different from Speaker $S$ that tries to separate input object into words, $L$ tries to concatenates words to understand the combined meaning. The output layer results are also the probability distribution of
hidden layer into two sub-layers, $L$ concatenates two sub-layers into one
output layer. The output layer results are also the probability distribution of
symbols $\hat{t}$ with given input sequence $s$, i.e, $o^{L}=P(\hat{t}|s_0,s_1)$. symbols $\hat{t}$ with given input sequence $s$, i.e, $o^{L}=P(\hat{t}|s_0,s_1)$.
\note{The final readout symbol is sampled based the probability.}
...@@ -79,7 +75,7 @@ symbols $\hat{t}$ with given input sequence $s$, i.e, $o^{L}=P(\hat{t}|s_0,s_1)$ ...@@ -79,7 +75,7 @@ symbols $\hat{t}$ with given input sequence $s$, i.e, $o^{L}=P(\hat{t}|s_0,s_1)$
To remove all the handcrafted induction as well as for a more realistic To remove all the handcrafted induction as well as for a more realistic
scenario, agents for this referential game are independent to each other, scenario, agents for this referential game are independent of each other,
without sharing model parameters or architectural connections. As shown in without sharing model parameters or architectural connections. As shown in
Algorithm~\ref{al:learning}, we train the separate Speaker $S$ and Listener $L$ with Algorithm~\ref{al:learning}, we train the separate Speaker $S$ and Listener $L$ with
Stochastic Policy Gradient methodology in a tick-tock manner, i.e, training one Stochastic Policy Gradient methodology in a tick-tock manner, i.e, training one
...@@ -90,13 +86,13 @@ $\theta_S$, where $\theta_S$ is the neural network parameters of Speaker $S$ ...@@ -90,13 +86,13 @@ $\theta_S$, where $\theta_S$ is the neural network parameters of Speaker $S$
with learned output probability distribution $\pi_S$, and $\theta_L$ is the with learned output probability distribution $\pi_S$, and $\theta_L$ is the
neural network parameters of Listener with learned probability distribution $\pi_L$. neural network parameters of Listener with learned probability distribution $\pi_L$.
Similarly, when training the Listener, the target is set to maximize the Similarly, when training the Listener, the target is set to maximize the
expected reward$ J(theta_S, theta_L)$ by fixing the parameter $\theta_S$ and expected reward$ J(\theta_S, \theta_L)$ by fixing the parameter $\theta_S$ and
adjusting the parameter $\theta_L$. adjusting the parameter $\theta_L$.
Additionally, to avoid the handcrafted induction on emergent language, we only Additionally, to avoid the handcrafted induction on emergent language, we only
use the predict result $\hat{t}$ of the listener agent as the use the predict result $\hat{t}$ of the listener agent as the
evidence of whether giving the positive rewards. Then, the gradients of the evidence of whether giving the positive rewards. Then, the gradients of the
expected reward $ J(theta_S, theta_L)$ can be calculated as follows: expected reward $ J(\theta_S, \theta_L)$ can be calculated as follows:
\begin{align} \begin{align}
\nabla_{\theta^S} J &= \mathbb{E}_{\pi^S, \pi^L} \left[ R(\hat{t}, t) \cdot \nabla_{\theta^S} J &= \mathbb{E}_{\pi^S, \pi^L} \left[ R(\hat{t}, t) \cdot
\nabla_{\theta^S} \log{\pi^S(s_0, s_1 | t)} \right] \\ \nabla_{\theta^S} \log{\pi^S(s_0, s_1 | t)} \right] \\
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment