Prior studies focus on achieving high compositional symbolic language
through \emph{deliberately handcrafted} inductions, e.g., small vocabulary
sizes~\cite{}, memoryless~\cite{}, additional rewards~\cite{}, constructed loss functions~\cite{}, and
ease-of-teaching~\cite{}. \note{Such optimization methodologies are driven by the challenges to generate high compositional symbolic without induction in an existing multi-agent environment.}
Figure~\ref{fig:induction} reports the compositionality when training two agents
in the widely-used listener-speaker referential game for emerging 100 symbolic
languages, and it can be observed that \note{the compositionality
...
...
@@ -180,6 +204,7 @@ can generate a higher compositional symbolic language with a higher probability.
%%\endsection
In this paper, we made the following contributions:
\begin{itemize}[topsep=0pt,itemsep=0cm]
\item To our best knowledge, we are the first work to successfully achieve
Previous works focus on the external environmental factors that impact the
...
...
@@ -34,8 +12,8 @@ For example, ~\citet{kirby2015compression} explored how the pressures for expres
~\citet{li2019ease} studied how the pressure, ease of teaching, impact on the iterative language of the population regime.
~\citet{evtimova2018emergent} designed a novel multi-modal scenarios, which the speaker and the listener should access to different modalities of the input object, to explore the language emergence.
Such factors are deliberately designed, which are too ideal to be true in
the real world. None of these works realizes the importance of model capacity of
agent itself.
the real world.
In this paper, these handcrafted inductions above are all removed, and the high compostional language is leaded only by the agent capacity.
\caption{The architecture of agents. \emph{Left:} speaker. \emph{Right:} listener.}
\label{fig:agents}
\end{figure*}
\section{ Symbolic Language Producing }
\label{sec:thory}
Before going to the detail of the training algorithms, we first introduce the environment, gaming rules, and agent architecture for enabling the emergence of symbolic language.
\begin{algorithm}[t]
\caption{Learning Algorithm$(t,\hat{t})$}
\label{al:learning}
\small
\begin{algorithmic}[1]
\IF{Training the speaker agent S}
\FOR{Batch T randomly selected from $M_0\times M_1$}
\STATE Update $\theta^L$ by $\bigtriangledown_{\theta^L}J$
\ENDFOR
\STATE$\pi_{old}^L\leftarrow\pi^L$
\ENDFOR
\ENDIF
\end{algorithmic}
\end{algorithm}
Before going to the detail of the training algorithms, we first introduce the environment, gaming rules, and agent architecture for enabling the emergence of symbolic language.
\subsection{Environment setup}
\label{ssec:env}
...
...
@@ -43,12 +90,6 @@ Please note that since $t$ and $\hat{t}$ have different length, we say
$t=\hat{t}$ if $t$ expresses the same meaning as $\hat{t}$, e.g., ``red circle''.
\caption{An emergent language that the unilateral metrics cannot measure its non-compositionality. Notice that given $s_1=\mathrm{a}$, the listener can neither determine the shape nor the color without the knowledge about $s_0$.}
\label{fig:unilateral}
\end{figure}
Before giving the definition of MIS, we first model the agents in the referential games. As shown in Figure~\ref{fig:modeling}, the listener and speaker in the referential game are connected in tandem. The speaker agent can be regard as a channel, whose input is a concept $c =(c_0, c_1)$ and output is a symbol $s =(s_0, s_1)$. The listener agent can be regard as another channel, whose input is a symbol $s =(s_0, s_1)$ and output is a predict result $\hat{t}=(\hat{c}_0, \hat{c}_1)$. Since the output of the listener only depends on the symbol $s$, we can model the policy of the speaker agent and the listener agent by the probability distribution $P(s =(s_0, s_1) | t =(c_0, c_1))$ and $P(\hat{t}=(\hat{c}_0, \hat{c}_1) | s_0, s_1)$, respectively.
Now we can analyse the information of the concepts preserved in the transmission process given the symbol transmitted, i.e. the conditional mutual information $I\left(t,\hat{t}|s\right)$. Whenever a stable language emerged, the speaker and the listener consistently use a specific symbol $s$ to refer to a specific object $t$. Therefore we can safely say $I\left(t,\hat{t}|s\right)= I\left(t,\hat{t}|s_{t,\hat{t}}\right)$ where $s_{t,\hat{t}}=\max_s\left\{P\left(\hat{t}|s\right)P\left(s|t\right)\right\}$. This conditional mutual information can be obtained by Equation~\ref{eq:cmi}.
Each column of $M$ correspond to the semantic information carried by one symbol. In a perfectly compositional language, each symbol represents one specific concept exclusively. Therefore, the similarity between the columns of $M$ and a one-hot vector is align with the compositionality of the emergent language.
\caption{An emergent language that the unilateral metrics cannot measure its non-compositionality. Notice that given $s_1=\mathrm{a}$, the listener can neither determine the shape nor the color without the knowledge about $s_0$.}