In this section, we propose the \emph{Mutual Information Similarity (MIS)} as a metric of compositionality, and give a thorough theoretical analyse.
MIS is the similarity between an identity matrix and the mutual information matrix of concepts and symbols.
Before giving the definition of MIS, we first model the agents in the referential games. As shown in Figure~\ref{}, the listener and speaker in the referential game are connected in tandem. The speaker agent $S$ can be regard as a channel, whose input is a concept $c =($c_0, c_1$) and output is a symbol $s = ($s_0, s_1$). The listener agent $L$ can be regard as another channel, whose input is a symbol $s =($s_0, s_1$) and output is a predict result $\hat{t} = (\hat{c}_0, \hat{c}_1)$. Since the output of $L$ only depends on the symbol $s$, we can model the policy of the speaker agent and the listener agent by the probability distribution $P(s = (s_0, s_1) | t = (c_0, c_1))$ and $P(\hat{t} = (\hat{c}_0, \hat{c}_1) | s_0, s_1)$, respectively.
\caption{The information channel modeling of the agents in the referential game.}
\label{fig:modeling}
\end{figure}
Now we can analyse the information of the concepts preserved in the transmission process given the symbol transmitted, i.e. the conditional mutual information $MI\left(t,\hat{t}|s\right)$. Whenever a stable language emerged, the speaker and the listener consistently use a specific symbol $s$ to refer to a specific object $t$. Therefore we can safely say $MI\left(t,\hat{t}|s\right) = MI\left(t,\hat{t}|s=s_{t,\hat{t}}\right)$ where $s_{t,\hat{t}}=\argmax_s\left\{P\left(\hat{t}|s\right)P\left(s|t\right)\right\}$. This conditional mutual information can be obtained by Equation~\ref{eq:cmi}.
Before giving the definition of MIS, we first model the agents in the referential games. As shown in Figure~\ref{fig:modeling}, the listener and speaker in the referential game are connected in tandem. The speaker agent can be regard as a channel, whose input is a concept $c =(c_0, c_1)$ and output is a symbol $s =(s_0, s_1)$. The listener agent can be regard as another channel, whose input is a symbol $s =(s_0, s_1)$ and output is a predict result $\hat{t}=(\hat{c}_0, \hat{c}_1)$. Since the output of the listener only depends on the symbol $s$, we can model the policy of the speaker agent and the listener agent by the probability distribution $P(s =(s_0, s_1) | t =(c_0, c_1))$ and $P(\hat{t}=(\hat{c}_0, \hat{c}_1) | s_0, s_1)$, respectively.
Now we can analyse the information of the concepts preserved in the transmission process given the symbol transmitted, i.e. the conditional mutual information $I\left(t,\hat{t}|s\right)$. Whenever a stable language emerged, the speaker and the listener consistently use a specific symbol $s$ to refer to a specific object $t$. Therefore we can safely say $I\left(t,\hat{t}|s\right)= I\left(t,\hat{t}|s_{t,\hat{t}}\right)$ where $s_{t,\hat{t}}=\max_s\left\{P\left(\hat{t}|s\right)P\left(s|t\right)\right\}$. This conditional mutual information can be obtained by Equation~\ref{eq:cmi}.
We define the ratio of preserved information $RI(t, s)$ as Equation~\ref{eq:ri}, where $H(t)$ denotes the information entropy of $t$. $RI(t,s)$ measures the degree of alignment between symbols and objects.
We define the ratio of preserved information $R(t, s)$ as Equation~\ref{eq:ri}, where $H(t)$ denotes the information entropy of $t$. $R(t,s)$ measures the degree of alignment between symbols and objects.
Following the Equation~\ref{eq:ri} we can obtain the normalized mutual information matrix $MRI^B$ by collecting $RI(c_i, s_j)$ for all $i, j$, as Equation~\ref{eq:mri}.
Following the Equation~\ref{eq:ri} we can obtain the normalized mutual information matrix $M$ by collecting $R(c_i, s_j)$ for all $i, j$, as Equation~\ref{eq:mri}.
\begin{equation}\label{eq:mri}
MRI^B=
M =
\begin{pmatrix}
RI\left(c_0,s_0\right)& RI\left(c_0,s_0\right)\\
RI\left(c_0,s_0\right)& RI\left(c_0,s_0\right)
R\left(c_0,s_0\right) & R\left(c_0,s_0\right)\\
R\left(c_0,s_0\right) & R\left(c_0,s_0\right)
\end{pmatrix}
\end{equation}
Each column of $MRI^B$ correspond to the semantic information carried by one symbol. In a perfectly compositional language, each symbol represents one specific concept exclusively. Therefore, the similarity between the columns of $MRI^B$ and a one-hot vector is align with the compositionality of the emergent language.
Finally, we define $MIS_0$ as the average cosine similarity of $MRI^B$ columns and one-hot vectors, as Equation~\ref{eq:mis2}. Furthermore, $MIS$ is the normalized $MIS_0$ into the $[0,1]$ value range.
Each column of $M$ correspond to the semantic information carried by one symbol. In a perfectly compositional language, each symbol represents one specific concept exclusively. Therefore, the similarity between the columns of $M$ and a one-hot vector is align with the compositionality of the emergent language.
Finally, we define \emph{raw mutual information similarity} (denoted as $S_0$) as the average cosine similarity of $M$ columns and one-hot vectors, as Equation~\ref{eq:mis2}. Furthermore, MIS (denoted as $S$) is the normalized raw mutual information similarity into the $[0,1]$ value range.
@@ -43,5 +51,5 @@ MIS = \frac{N\cdot MIS_0 - 1}{N-1}
\label{fig:unilateral}
\end{figure}
MIS is a bilateral metric. Unilateral metrics, e.g. \emph{topographic similarity (topo)}\cite{} and \emph{posdis}\cite{}, only take the policy of the speaker into consideration. We provide an example to illustrate the inadequacy of unilateral metrics, as shown in Figure~\ref{fig:unilateral}. In this example, the speaker only uses $s_1$ to represent shape. From the perspective of speaker, the language is perfectly compositional (i.e. both topo and posdis are 1). However, the listener cannot distinguish the shape depend only on $s_1$, showing the non-compositionality in this language. The bilateral metric MIS addresses such defect by taking the policy of the listener into account, thus MIS < 1.
MIS is a bilateral metric. Unilateral metrics, e.g. \emph{topographic similarity (topo)}\cite{} and \emph{posdis}\cite{}, only take the policy of the speaker into consideration. We provide an example to illustrate the inadequacy of unilateral metrics, shown in Figure~\ref{fig:unilateral}. In this example, the speaker only uses $s_1$ to represent shape. From the perspective of speaker, the language is perfectly compositional (i.e. both topo and posdis are 1). However, the listener cannot distinguish the shape depend only on $s_1$, showing the non-compositionality in this language. The bilateral metric MIS addresses such defect by taking the policy of the listener into account, thus $MIS < 1$.