The Best Reinforcement Learning Papers from the ICLR 2020 Conference

*(left) Training architecture for the embedding network (right) NGU’s reward generator.*

Main authors:

Adrià Puigdomènech Badia

LinkedIn | GitHub

Pablo Sprechmann

Twitter | LinkedIn

2. Program Guided Agent

We propose a modular framework that can accomplish tasks specified by programs and achieve zero-shot generalization to more complex tasks.

(TL;DR, from OpenReview.net)

An illustration of the proposed problem. We are interested in learning to fulfill tasks specified by written programs. A program consists of control flows (e.g. if, while), branching conditions (e.g. is_there[River]), and subtasks (e.g. mine(Wood)).

First author: Shao-Hua Sun

3. Model Based Reinforcement Learning for Atari

We use video prediction models, a model-based reinforcement learning algorithm and 2h of gameplay per game to train agents for 26 Atari games.

(TL;DR, from OpenReview.net)

Paper | Code

Main loop of SimPLe. 1) the agent starts interacting with the real environment following the latest policy (initialized to random). 2) the collected observations will be used to train (update) the current world model. 3) the agent updates the policy by acting inside the world model. The new policy will be evaluated to measure the performance of the agent as well as collecting more data (back to 1). Note that world model training is self-supervised for the observed states and supervised for the reward.

Main authors:

Łukasz Kaiser

Błażej Osiński

4. Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents

We generate critical states of a trained RL algorithms to visualize potential weaknesses.

(TL;DR, from OpenReview.net)

Qualitative Results: Visualization of different target functions (Sec. 2.3). T⁺ generates high reward and T⁻ low reward states; T^± generates states in which one action is highly beneficial and another is bad.

First author: Christian Rupprecht

5. Meta-Learning without Memorization

We identify and formalize the memorization problem in meta-learning and solve this problem with novel meta-regularization method, which greatly expand the domain that meta-learning can be applicable to and effective on.

(TL;DR, from OpenReview.net)

Paper | Code

Left: An example of non-mutually-exclusive pose prediction tasks, which may lead to the memorization problem. The training tasks are non-mutually-exclusive because the test data label (right) can be inferred accurately without using task training data (left) in the training tasks, by memorizing the canonical orientation of the meta-training objects. For a new object and canonical orientation (bottom), the task cannot be solved without using task training data (bottom left) to infer the canonical orientation. Right: Graphical model for meta-learning. Observed variables are shaded. Without either one of the dashed arrows, Yˆ ∗ is conditionally independent of D given θ and X∗, which we refer to as complete memorization (Definition 1).

Main authors

Mingzhang Yin

Chelsea Finn

6. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

Exponential lower bounds for value-based and policy-based reinforcement learning with function approximation.

(TL;DR, from OpenReview.net)

An example with H = 3. For this example, we have r(s₅) = 1 and r(s) = 0 for all other states s. The unique state s₅ which satisfies r(s) = 1 is marked as dash in the figure. The induced Q∗ function is marked on the edges.

First author: Simon S. Du

Twitter | LinkedIn | Website

7. The Ingredients of Real World Robotic Reinforcement Learning

System to learn robotic tasks in the real world with reinforcement learning without instrumentation.

(TL;DR, from OpenReview.net)

Paper

Illustration of our proposed instrumentation-free system requiring minimal human engineering. Human intervention is only required in the goal collection phase (1). The robot is left to train unattended (2) during the learning phase and can be evaluated from arbitrary initial states at the end of training (3). We show sample goal and intermediate images from the training process of a real hardware system

First author: Henry Zhu

LinkedIn | Website

8. Improving Generalization in Meta Reinforcement Learning using Learned Objectives

We introduce MetaGenRL, a novel meta reinforcement learning algorithm. Unlike prior work, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training.

(TL;DR, from OpenReview.net)

A schematic of MetaGenRL. On the left a population of agents (i ∈ 1, . . . , N), where each member consist of a critic Q ⁽ⁱ⁾ _θ and a policy π ⁽ⁱ⁾ _φ that interact with a particular environment e⁽ⁱ⁾and store collected data in a corresponding replay buffer B⁽ⁱ⁾ . On the right a meta-learned neural objective function L_α that is shared across the population. Learning (dotted arrows) proceeds as follows: Each policy is updated by differentiating L_α, while the critic is updated using the usual TD-error (not shown). L_α is meta-learned by computing second-order gradients that can be obtained by differentiating through the critic.

First author: Louis Kirsh

Twitter | LinkedIn | Website

9. Making Sense of Reinforcement Learning and Probabilistic Inference

Popular algorithms that cast “RL as Inference” ignore the role of uncertainty and exploration. We highlight the importance of these issues and present a coherent framework for RL and inference that handles them gracefully.

(TL;DR, from OpenReview.net)

*Regret scaling on Problem 1. Soft Q-learning does not scale gracefully with N.*

First author: Brendan O’Donoghue

10. SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference

SEED RL, a scalable and efficient deep reinforcement learning agent with accelerated central inference. State of the art results, reduces cost and can process millions of frames per second.

(TL;DR, from OpenReview.net)

Paper | Code

First author: Lasse Espeholt

LinkedIn | GitHub

11. Multi-agent Reinforcement Learning for Networked System Control

This paper proposes a new formulation and a new communication protocol for networked multi-agent control problems.

(TL;DR, from OpenReview.net)

Paper | Code

Forward propagations of NeurComm enabled MARL, illustrated in a queueing system. (a) Single-step forward propagations inside agent i. Different colored boxes and arrows show different outputs and functions, respectively. Solid and dashed arrows indicate actor and critic propagations, respectively. (b) Multi-step forward propagations for updating
*the belief of agent i.*

First author: Tianshu Chu

Website

12. A Generalized Training Approach for Multiagent Learning

This paper studies and extends Policy-Spaced Response Oracles (PSRO). It’s a population-based learning method that uses game theory principles. Authors extend the method so that it’s applicable to multi-player games, while providing convergence guarantees in multiple settings.

*Overview of PSRO(M, O) algorithm phases.*

First author: Paul Muller

Website

13. Implementation Matters in Deep RL: A Case Study on PPO and TRPO

Sometimes an implementation detail may play a role in your research. Here, two policy search algorithms were evaluated: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). “Code-level optimizations”, should be negligible form the learning dynamics. Surprisingly, it turns out that h optimizations turn out to have a major impact on agent behavior.

Paper | Code

An ablation study on the first four optimizations described in Section 3 (value clipping, reward scaling, network initialization, and learning rate annealing). For each of the 2⁴ possible configurations of optimizations, we train a Humanoid-v2 (top) and Walker2d-v2 (bottom) agent using PPO with five random seeds and a grid of learning rates, and choose the learning rate which gives the best average reward (averaged over the random seeds). We then consider all rewards from the “best learning rate” runs (a total of 5 × 2⁴ agents), and plot histograms in which agents are partitioned based on whether each optimization is on or off. Our results show that reward normalization, Adam annealing, and network initialization each significantly impact the rewards landscape with respect to hyperparameters, and were necessary for attaining the highest PPO reward within the tested hyperparameter grid.

Main authors:

Logan Engstrom

Aleksander Madry

14. A Closer Look at Deep Policy Gradients

This is in-depth, empirical study of the behavior of the deep policy gradient algorithms. Authors analyse SOTA methods based on gradient estimation, value prediction, and optimization landscapes.

Empirical variance of the estimated gradient (c.f. (1)) as a function of the number of state-action pairs used in estimation in the MuJoCo Humanoid task. We measure the average pairwise cosine similarity between ten repeated gradient measurements taken from the same policy, with the 95% confidence intervals (shaded). For each algorithm, we perform multiple trials with the same hyperparameter configurations but different random seeds, shown as repeated lines in the figure. The vertical line (at x = 2K) indicates the sample regime used for gradient estimation in standard implementations of policy gradient methods. In general, it seems that obtaining tightly concentrated gradient estimates would require significantly more samples than are used in practice, particularly after the first few timesteps. For other tasks – such as Walker2d-v2 and Hopper-v2 – the plots have similar trends, except that gradient variance is slightly lower. Confidence intervals calculated with 500 sample bootstrapping.

Main authors:

Andrew Ilyas

Aleksander Madry

15. Meta-Q-Learning

MQL is a simple off-policy meta-RL algorithm that recycles data from the meta-training replay buffer to adapt to new tasks.

(TL;DR, from OpenReview.net)