Commit 71abd7fb by haoyifan

Upload New File

parent 8d738f52
In this section, a referential game platform and a speaker-listener model are introduced. Referential game is commonly used in the emergent language study, such as [][]. In this game, the speaker needs communicate with the listener to complete a task cooperatively. The game setup for the referential game is firstly described. Then, how to construct the speaker-listener with the neural networks is introduced. Lastly, the training algorithm and the evaluation
methods are discussed.
#subsection1: Set up
In the referential game, the agents should obey the following rules:
a)The speaker agent S uses the input object t to output the corresponding symbol sequence s;
b)The listener agent L uses the symbol sequence s to output the predict result $\hat{t}$;
c)If $t=\hat{t}$, this game is successful, and each agent receives reward $R(t,\hat{t}=1$; otherwise, the game is failed, and the reward is set as $R(t,\hat{t}=-1$.
An input object t is a concept sequence with fixed length, denoted $t=(c_0,c_1)$. The concept $c_0(shape)$ and $c_1(color)$ are indicated as a one-hot vector respectively. The length of each one-hot vector ranges from 3 to 6. These two vectors are concatenated to denote the input object t.
Each symbol sequence s contains two words, denoted $(s_0,s_1)$. Each word $s_i$ is chosen in the vocabulary set $V$. In this game, let the card $|V|$ range from 4 to 10, and the inequation $|V|^2\geq|M_1||M_1|$ is satisfied to ensure the symbol sequence $(s_0,s_1)$ can be used to denote all the input object t. The one-hot vector with the length $|V|$ is used to indicate the word $s_0$ and $s_1$ respectively. Then, the two one-hot vectors are concatenated to denote the symbol sequence s.
The predict result $\hat{t}$ is denoted as a one-hot vector with the length $|M_0||M_1|$. Each bit of the one-hot vector denotes one input object. If the predict result $\hat{t}[i*|M_1|+j]=1$, the one-hot vector of each predict concept $\hat{c}_0$ and $\hat{c}_1$ respectively satisfied $\hat_{c}_0[i]=1$ and $\hat{c}_1[j]=1$.
If $(c_0,c_1) is equal to $(\hat{c}_0,\hat{c}_1)$, the input object and the predict result indicate the same object.
#subsection2: Agent architecture
The agents apply their own policy to play the referential game. Denote the policy of the speaker agent S and the listener L as $\pi_S$ and $\pi_L$. $\pi_S$ indicates the conditional probability $P(s_0|t)$ and $P(s_1|t)$. $\pi_L$ indicates the conditional probability $P(\hat{t}|s_0,s_1)$. The listener agent output predict result $\hat{t}$ through random sampling on the conditional probability $P(\hat{t}|s_0,s_1)$. The neural networks are used to simulate the agent policy. The agent architecture is shown in Figure 1.
For the speaker, the input object t is firstly passed to a MLP to get a hidden layer vector h^S. Then, the hidden layer vector is split into two feature vectors h_0^S and h_1^S with length h_size. Through a MLP and a softmax layer, these feature vectors are transformed as the output o_0 and o_1 with the length |V| respectively. Lastly, the symbol sequences s_0 and s_1 are sampled from the output o_0 and o_1.
For the listener, the input symbol sequences s_0 and s_1 are passed into a MLP respectively to get the hidden layer vectors h_0 and h_1. The length of each vector is h_size. Concatenating these vectors, and passing the conjunctive vector into a MLP and a softmax layer, the output o^L with length $|M_0||M_1|$ denotes P(\hat{t}|s_0,s_1). Lastly, the predict result is sampled from the output o^L.
In the experiments, the symbol h_size is used to denote the model capacity of the agents.
#subsection3: Training Algorithm
In this paper, the Stochastic Policy Gradient methodology is used to train the speaker and the listener respectively. The symbol $\theta_S$ and $\theta_L$ denote the neural network parameters of the policy $\pi_S$ and $\pi_L$ respectively. When training the speaker, the parameter $\theta_L$ is fixed, and the training objective is to maximize the expected reward $ J(theta_S, theta_L) = E_{\pi_S,\pi_L}[R(t, t^)]$ through adjusting the parameter $\theta_S$. In a similar way, the listener is trained to maximize the expected reward$ J(theta_S, theta_L)$ by fixing the parameter $\theta_S$ and adjusting the parameter $\theta_L$. To minimize the influence of artificial induction on emergent language, we only use the predict result $\hat{t}$ of the listener agent as the evidence of whether giving the positive rewards. Then, the gradients of the expected reward $ J(theta_S, theta_L)$ can be calculated as follows:
\begin{align}
\nabla_{\theta^S} J &= \mathbb{E}_{\pi^S, \pi^L} \left[ R(\hat{t}, t) \cdot \nabla_{\theta^S} \log{\pi^S(s_0, s_1 | t)} \right] \\
\nabla_{\theta^L} J &= \mathbb{E}_{\pi^S, \pi^L} \left[ R(\hat{t}, t) \cdot \nabla_{\theta^L} \log{\pi^S(\hat{t} | s_0, s_1)} \right]
\end{align}
Unlike previous studies[][], the agents in this paper are totally independent. It means that all the neural networks parameters of each agent are not shared, and there are not any connection between the architecture of the neural networks. The training procedure is shown in Figure 2. The training process is the alternations of two procedure: the speaker training and the listener training. When one agent is training, the parameters of the other agent are fixed.
\begin{algorithm}[!h]
\caption{OurAlgorithm$(t,\hat{t})$}
\begin{algorithmic}[1]
\IF{Training the speaker agent S}
\FOR{Batch T randomly selected from $M_0\times M_1$}
\FOR{$t=(c_0,c_1)$ in T}
\STATE $P(s_0|t),P(s_1|t)=\pi_{old}^S(s=(s_0,s_1)|t)$
\STATE Sample $s_0$ with $P(s_0|t)$, $s_1$ with $P(s_1|t)$
\STATE $P(\hat{t}|s) = \pi^L(\hat{t}|s)$
\STATE Sample $\hat{t}$ with $P(\hat{t}|s)$
\STATE Get reward $R(\hat{t},t)$
\STATE $J(\theta^S,\theta^L)=E_{\pi_{old}^S,\pi^L}[R(\hat{t},t)\cdot\frac{\pi^S(s|t)}{\pi^S_{old}(s|t)}]$
\STATE Update $\theta^S$ by $\bigtriangledown_{\theta^S}J$
\ENDFOR
\STATE $\pi_{old}^S\leftarrow \pi^S$
\ENDFOR
\ENDIF
\IF{Training the listener agent L}
\FOR{Batch T randomly selected from $M_0\times M_1$}
\FOR{$t=(c_0,c_1)$ in T}
\STATE $P(s_0|t),P(s_1|t)=\pi^S(s=(s_0,s_1)|t)$
\STATE Sample $s_0$ with $P(s_0|t)$, $s_1$ with $P(s_1|t)$
\STATE $P(\hat{t}|s) = \pi^L_{old}(\hat{t}|s)$
\STATE Sample $\hat{t}$ with $P(\hat{t}|s)$
\STATE Get reward $R(\hat{t},t)$
\STATE $J(\theta^S,\theta^L)=E_{\pi_{old}^S,\pi^L}[R(\hat{t},t)\cdot\frac{\pi^L(s|t)}{\pi^L_{old}(s|t)}]$
\STATE Update $\theta^L$ by $\bigtriangledown_{\theta^L}J$
\ENDFOR
\STATE $\pi_{old}^L\leftarrow \pi^L$
\ENDFOR
\ENDIF
\end{algorithmic}
\end{algorithm}
Figure2. Training Algorithm of agents
#subsection4: Evaluation
Our objective is to study the relationship between the agent model capacity and the compositionality of the emergent language, within the range afforded by the need for successful communication. When the accuracy of the listener converges to 100\%, it is believed that the training process is finished. With one training process, the agent model is evaluated through two aspects: the model capacity and the compositionality of the emergent language.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment