Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
A
AAAI21_Emergent_language
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
haoyifan
AAAI21_Emergent_language
Commits
9324ac4d
Commit
9324ac4d
authored
Sep 09, 2020
by
Zidong Du
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
~
parent
5401c11a
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
87 additions
and
59 deletions
+87
-59
AAAI2021/paper.tex
+2
-35
AAAI2021/tex/theory.tex
+85
-24
No files found.
AAAI2021/paper.tex
View file @
9324ac4d
...
...
@@ -8,6 +8,8 @@
\newcommand
{
\rmk
}
[1]
{
\textcolor
{
red
}{
--[#1]--
}}
\newcommand
{
\note
}
[1]
{
\textcolor
{
red
}{
#1
}}
\usepackage
{
enumitem
}
\usepackage
{
amsmath
}
\usepackage
{
amsfonts
}
\usepackage
{
aaai21
}
% DO NOT CHANGE THIS
\usepackage
{
times
}
% DO NOT CHANGE THIS
...
...
@@ -224,41 +226,6 @@
\input
{
tex/experiments.tex
}
\input
{
tex/last.tex
}
\begin{algorithm}
[!h]
\caption
{
OurAlgorithm
$
(
t,
\hat
{
t
}
)
$}
\begin{algorithmic}
[1]
\IF
{
Training the speaker agent S
}
\FOR
{
Batch T randomly selected from
$
M
_
0
\times
M
_
1
$}
\FOR
{$
t
=(
c
_
0
,c
_
1
)
$
in T
}
\STATE
$
P
(
s
_
0
|t
)
,P
(
s
_
1
|t
)=
\pi
_{
old
}^
S
(
s
=(
s
_
0
,s
_
1
)
|t
)
$
\STATE
Sample
$
s
_
0
$
with
$
P
(
s
_
0
|t
)
$
,
$
s
_
1
$
with
$
P
(
s
_
1
|t
)
$
\STATE
$
P
(
\hat
{
t
}
|s
)
=
\pi
^
L
(
\hat
{
t
}
|s
)
$
\STATE
Sample
$
\hat
{
t
}$
with
$
P
(
\hat
{
t
}
|s
)
$
\STATE
Get reward
$
R
(
\hat
{
t
}
,t
)
$
\STATE
$
J
(
\theta
^
S,
\theta
^
L
)=
E
_{
\pi
_{
old
}^
S,
\pi
^
L
}
[
R
(
\hat
{
t
}
,t
)
\cdot\frac
{
\pi
^
S
(
s|t
)
}{
\pi
^
S
_{
old
}
(
s|t
)
}
]
$
\STATE
Update
$
\theta
^
S
$
by
$
\bigtriangledown
_{
\theta
^
S
}
J
$
\ENDFOR
\STATE
$
\pi
_{
old
}^
S
\leftarrow
\pi
^
S
$
\ENDFOR
\ENDIF
\IF
{
Training the listener agent L
}
\FOR
{
Batch T randomly selected from
$
M
_
0
\times
M
_
1
$}
\FOR
{$
t
=(
c
_
0
,c
_
1
)
$
in T
}
\STATE
$
P
(
s
_
0
|t
)
,P
(
s
_
1
|t
)=
\pi
^
S
(
s
=(
s
_
0
,s
_
1
)
|t
)
$
\STATE
Sample
$
s
_
0
$
with
$
P
(
s
_
0
|t
)
$
,
$
s
_
1
$
with
$
P
(
s
_
1
|t
)
$
\STATE
$
P
(
\hat
{
t
}
|s
)
=
\pi
^
L
_{
old
}
(
\hat
{
t
}
|s
)
$
\STATE
Sample
$
\hat
{
t
}$
with
$
P
(
\hat
{
t
}
|s
)
$
\STATE
Get reward
$
R
(
\hat
{
t
}
,t
)
$
\STATE
$
J
(
\theta
^
S,
\theta
^
L
)=
E
_{
\pi
_{
old
}^
S,
\pi
^
L
}
[
R
(
\hat
{
t
}
,t
)
\cdot\frac
{
\pi
^
L
(
s|t
)
}{
\pi
^
L
_{
old
}
(
s|t
)
}
]
$
\STATE
Update
$
\theta
^
L
$
by
$
\bigtriangledown
_{
\theta
^
L
}
J
$
\ENDFOR
\STATE
$
\pi
_{
old
}^
L
\leftarrow
\pi
^
L
$
\ENDFOR
\ENDIF
\end{algorithmic}
\end{algorithm}
\bibliography
{
ref.bib
}
...
...
AAAI2021/tex/theory.tex
View file @
9324ac4d
...
...
@@ -54,29 +54,90 @@ circle''.
\label
{
fig:agents
}
\end{figure}
The agents apply their own policy to play the referential game. Denote the
policy of the speaker agent S and the listener L as
$
\pi
_
S
$
and
$
\pi
_
L
$
.
$
\pi
_
S
$
indicates the conditional probability
$
P
(
s
_
0
|t
)
$
and
$
P
(
s
_
1
|t
)
$
.
$
\pi
_
L
$
indicates the conditional probability
$
P
(
\hat
{
t
}
|s
_
0
,s
_
1
)
$
. The listener agent
output predict result
$
\hat
{
t
}$
through random sampling on the conditional
probability
$
P
(
\hat
{
t
}
|s
_
0
,s
_
1
)
$
. The neural networks are used to simulate the
agent policy. The agent architecture is shown in Figure 1.
For the speaker, the input object t is firstly passed to a MLP to get a hidden
layer vector
$
h
^
S
$
. Then, the hidden layer vector is split into two feature
vectors
$
h
_
0
^
S
$
and
$
h
_
1
^
S
$
with length h
\_
size. Through a MLP and a softmax layer,
these feature vectors are transformed as the output
$
o
_
0
$
and
$
o
_
1
$
with the length
|V| respectively. Lastly, the symbol sequences
$
s
_
0
$
and
$
s
_
1
$
are sampled from the
output
$
o
_
0
$
and
$
o
_
1
$
.
For the listener, the input symbol sequences
$
s
_
0
$
and
$
s
_
1
$
are passed into a MLP
respectively to get the hidden layer vectors
$
h
_
0
$
and
$
h
_
1
$
. The length of each
vector is h
\_
size. Concatenating these vectors, and passing the conjunctive
vector into a MLP and a softmax layer, the output
$
o
^
L
$
with length
$
|M
_
0
||M
_
1
|
$
denotes
$
P
(
\hat
{
t
}
|s
_
0
,s
_
1
)
$
. Lastly, the predict result is sampled from the
output
$
o
^
L
$
.
In the experiments, the symbol h
\_
size is used to denote the model capacity of
the agents.
\subsection
{
Training algorithm
}
Figure~
\ref
{
fig:agents
}
shows the architecture of the constructed agents,
including the Speaker
$
S
$
and Listener
$
L
$
.
\textbf
{
Speaker.
}
Regarding the Speaker
$
S
$
, it is constructed as a three-layer neural
network. The Speaker
$
S
$
processes the input object
$
t
$
with a fully-connected
layer to obtain the hidden layer
$
h
^
s
$
, which is split into two sub-layers. Each
sub-layer is further processed with fully-connected layers to obtain the output
layer. The output layer results indicate the probability distribution of symbols
with given input object
$
t
$
, i.e.,
$
o
_
i
^{
s
}
=
P
(
s
_
i|t
)
$
$
i
\in
{
0
,
1
}$
.
\note
{
The final
readout symbols are sampled based on such probability distribution.
}
\textbf
{
Listener.
}
Regarding the Listener
$
L
$
, it is constructed as a
three-layer neural network, too. Different from Speaker
$
S
$
that split the
hidden layer into two sub-layers,
$
L
$
concatenates two sub-layers into one
output layer. The output layer results are also the probability distribution of
symbols
$
\hat
{
t
}$
with given input sequence
$
s
$
, i.e,
$
o
^{
L
}
=
P
(
\hat
{
t
}
|s
_
0
,s
_
1
)
$
.
\note
{
The final readout symbol is sampled based the probability.
}
\subsection
{
Learning algorithm
}
\label
{
ssec:training
}
To remove all the handcrafted induction as well as for a more realistic
scenario, agents for this referential game are independent to each other,
without sharing model parameters or architectural connections. As shown in
Algorithm~
\ref
{
al:learning
}
, we train the separate Speaker
$
S
$
and Listener
$
L
$
with
Stochastic Policy Gradient methodology in a tick-tock manner, i.e, training one
agent while keeping the other one. Roughly, when training the Speaker, the
target is set to maximize the expected reward
$
J
(
\theta
_
S,
\theta
_
L
)=
E
_{
\pi
_
S,
\pi
_
L
}
[
R
(
t, t
^
)]
$
by adjusting the parameter
$
\theta
_
S
$
, where
$
\theta
_
S
$
is the neural network parameters of Speaker
$
S
$
with learned output probability distribution
$
\pi
_
S
$
, and
$
\theta
_
L
$
is the
neural network parameters of Listener with learned probability distribution
$
\pi
_
L
$
.
Similarly, when training the Listener, the target is set to maximize the
expected reward
$
J
(
theta
_
S, theta
_
L
)
$
by fixing the parameter
$
\theta
_
S
$
and
adjusting the parameter
$
\theta
_
L
$
.
Additionally, to avoid the handcrafted induction on emergent language, we only
use the predict result
$
\hat
{
t
}$
of the listener agent as the
evidence of whether giving the positive rewards. Then, the gradients of the
expected reward
$
J
(
theta
_
S, theta
_
L
)
$
can be calculated as follows:
\begin{align}
\nabla
_{
\theta
^
S
}
J
&
=
\mathbb
{
E
}_{
\pi
^
S,
\pi
^
L
}
\left
[ R(
\hat
{
t
}
, t)
\cdot
\nabla
_{
\theta
^
S
}
\log
{
\pi
^
S(s
_
0, s
_
1 | t)
}
\right
]
\\
\nabla
_{
\theta
^
L
}
J
&
=
\mathbb
{
E
}_{
\pi
^
S,
\pi
^
L
}
\left
[ R(
\hat
{
t
}
, t)
\cdot
\nabla
_{
\theta
^
L
}
\log
{
\pi
^
S(
\hat
{
t
}
| s
_
0, s
_
1)
}
\right
]
\end{align}
\begin{algorithm}
[t]
\caption
{
Learning Algorithm
$
(
t,
\hat
{
t
}
)
$}
\label
{
al:learning
}
\begin{algorithmic}
[1]
\IF
{
Training the speaker agent S
}
\FOR
{
Batch T randomly selected from
$
M
_
0
\times
M
_
1
$}
\FOR
{$
t
=(
c
_
0
,c
_
1
)
$
in T
}
\STATE
$
P
(
s
_
0
|t
)
,P
(
s
_
1
|t
)=
\pi
_{
old
}^
S
(
s
=(
s
_
0
,s
_
1
)
|t
)
$
\STATE
Sample
$
s
_
0
$
with
$
P
(
s
_
0
|t
)
$
,
$
s
_
1
$
with
$
P
(
s
_
1
|t
)
$
\STATE
$
P
(
\hat
{
t
}
|s
)
=
\pi
^
L
(
\hat
{
t
}
|s
)
$
\STATE
Sample
$
\hat
{
t
}$
with
$
P
(
\hat
{
t
}
|s
)
$
\STATE
Get reward
$
R
(
\hat
{
t
}
,t
)
$
\STATE
$
J
(
\theta
^
S,
\theta
^
L
)=
E
_{
\pi
_{
old
}^
S,
\pi
^
L
}
[
R
(
\hat
{
t
}
,t
)
\cdot\frac
{
\pi
^
S
(
s|t
)
}{
\pi
^
S
_{
old
}
(
s|t
)
}
]
$
\STATE
Update
$
\theta
^
S
$
by
$
\bigtriangledown
_{
\theta
^
S
}
J
$
\ENDFOR
\STATE
$
\pi
_{
old
}^
S
\leftarrow
\pi
^
S
$
\ENDFOR
\ENDIF
\IF
{
Training the listener agent L
}
\FOR
{
Batch T randomly selected from
$
M
_
0
\times
M
_
1
$}
\FOR
{$
t
=(
c
_
0
,c
_
1
)
$
in T
}
\STATE
$
P
(
s
_
0
|t
)
,P
(
s
_
1
|t
)=
\pi
^
S
(
s
=(
s
_
0
,s
_
1
)
|t
)
$
\STATE
Sample
$
s
_
0
$
with
$
P
(
s
_
0
|t
)
$
,
$
s
_
1
$
with
$
P
(
s
_
1
|t
)
$
\STATE
$
P
(
\hat
{
t
}
|s
)
=
\pi
^
L
_{
old
}
(
\hat
{
t
}
|s
)
$
\STATE
Sample
$
\hat
{
t
}$
with
$
P
(
\hat
{
t
}
|s
)
$
\STATE
Get reward
$
R
(
\hat
{
t
}
,t
)
$
\STATE
$
J
(
\theta
^
S,
\theta
^
L
)=
E
_{
\pi
_{
old
}^
S,
\pi
^
L
}
[
R
(
\hat
{
t
}
,t
)
\cdot\frac
{
\pi
^
L
(
s|t
)
}{
\pi
^
L
_{
old
}
(
s|t
)
}
]
$
\STATE
Update
$
\theta
^
L
$
by
$
\bigtriangledown
_{
\theta
^
L
}
J
$
\ENDFOR
\STATE
$
\pi
_{
old
}^
L
\leftarrow
\pi
^
L
$
\ENDFOR
\ENDIF
\end{algorithmic}
\end{algorithm}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment