amanpreet singh

Summary: VAIN: Attentional Multi-agent Predictive Modeling (NIPS 2017)

VAIN summary by apsdehal

February 8, 2018

Introduction

Author of the paper: Yedid Hoshen, Facebook AI Research, NYC

Helps in modeling the locality of interactions and improves performance by determining which agents will share information.
Can be thought of as CommNet with attention or factorized Interaction Networks .
Can model high-order interactions with linear complexity in the number of vertices while preserving the structure of the problem.
Tested on two non-physical tasks (chess and soccer) and a physical task (bouncing balls).
Paper

Model Architecture

Derivation

Starts from equations of Interaction Network and CommNet and modifies them to include attentional component.

Interaction Networks: Models each interaction by a neural network. Restricting to 2nd order interactions, let \(\psi_{int}(x_i, x_j)\) be the interaction between agents i and j, while \(\phi(x_i)\) be the non-interacting features of agent i. The output \(o_i\) is given by a function \(\theta()\):

\[o_i = \theta(\sum_{j \neq i} \psi_{int}(x_i, x_j), \phi(x_i))\]

Complexity: O(N²) evaluations of \(\psi_{int}\).

CommNet: Interactions are not modeled explicitly. Interaction vector is calculated for each agent \(\psi_{com}(x_i)\).

\[o_i = \theta(\sum_{j \neq i} \psi_{com}(x_j), \phi(x_i))\]

Issues: Though linear in complexity, there is too much burden for representation on \(\theta\)

VAIN: Instead of learning interaction for each pair of agents \(\psi_{int}(x_i, x_j)\), learn a communication vector \(\psi_{vain}^c(x_i)\) with an attention vector \(a_i = \psi_{vain}^a(x_i)\). Then the interaction between agents i and j is modeled by:

\[\psi_{int}(x_i, x_j) = e^{|a_i - a_j|^2}\psi_{vain}(x_j)\]

Then the output is given by:

\[o_i = \theta(\sum_{j \neq i} e^{|a_i - a_j|^2}\psi_{vain}(x_j), \phi(x_i))\]

In non-additive case, uses softmax for calculating attention weights.

Benefits: An efficient linear approximation for IN while preserving CommNet’s complexity for \(\psi()\).

Architecture

Refer figure for exact equations.
Agent features are encoded by
- a singleton encoder to generate an feature encoding
- a communication encoder to generate communication vector and attention vector.
For each agent an attention weighted vector is generated from weighted sum of communication vectors from all agents. Set weights for self-interactions to zero.
Concatenate feature encoding with the attended weight vector in above step to yield intermediate feature vector.
Finally, use a decoder to yield per agent vector. For regression, this vector is the final output while for classification this can be passed through softmax as it is scalar.

Experiments

In soccer, nearest neighbors receive most attention, rest of the players also receive roughly equal attention. Goalkeeper if far away, receives no attention.
In bouncing balls, the balls near to target ball receive strong attention. If a ball is on collision course with target ball, it receives stronger attention than the nearest neighbor.
Outperforms CommNet and IN on accuracy results for next moving piece experiments for chess.

Notes

Basically, a CommNet with attended communication vector. Tries to incorporate which communication is more important.
In sparse interactions systems, the attention mechanism will highlight significantly interacting agents. CommNets will fail in this case.
In mean field case, where the important interaction works in additive way, IN will fail, CommNet will work but VAIN will find proper attention weights and can improve on CommNet.
Less suitable for cases where interactions are not sparse and K most important interactions won’t give a good representation or in cases where interactions are strong and highly non-linear (mean field approximation is non-trivial)
Code hasn’t been released yet.

Follow me on Twitter and Github, that's where I'm most active these days. You can also email me at [email protected]. Thanks!

Follow @apsdehal