The Best Deep Learning Papers from the ICLR 2020 Conference

Posted May 5, 2020

Last week I had a pleasure to participate in the International Conference on Learning Representations (ICLR), an event dedicated to the research on all aspects of deep learning. Initially, the conference was supposed to take place in Addis Ababa, Ethiopia, however, due to the novel coronavirus pandemic, it went virtual. I’m sure it was a challenge for organisers to move the event online, but I think the effect was more than satisfactory, as you can read here!

Over 1300 speakers and 5600 attendees proved that the virtual format was more accessible for the public, but at the same time, the conference remained interactive and engaging. From many interesting presentations, I decided to choose 16, which are influential and thought-provoking. Here are the best deep learning papers from the ICLR.

Best Deep learning papers

1. On Robustness of Neural Ordinary Differential Equations

In-depth study of the robustness of the Neural Ordinary Differential Equations or NeuralODE in short. Use it as a building block for more robust networks.


The architecture of an ODENet. The neural ODE block serves as a dimension-preserving nonlinear mapping.

Hanshu YAN

First author: Hanshu YAN

LinkedIn | Website 

2. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

Gradient clipping provably accelerates gradient descent for non-smooth non-convex functions.

(TL;DR, from

Paper | Code

Gradient norm vs local gradient Lipschitz constant on a log-scale along the training trajectory for AWD-LSTM (Merity et al., 2018) on PTB dataset. The colorbar indicates the number of iterations during training. 

First author: Jingzhao Zhang

LinkedIn | Website

3. Target-Embedding Autoencoders for Supervised Representation Learning

New, general framework of target-embedding autoencoders or TEA for supervised prediction. Authors give both theoretical and empirical considerations.  


(a) Feature-embedding and (b) Target-embedding autoencoders. Solid lines correspond to the (primary) prediction task; dashed lines to the (auxiliary) reconstruction task. Shared components are involved in both.

Daniel Jarrett

First author: Daniel Jarrett


4. Understanding and Robustifying Differentiable Architecture Search

We study the failure modes of DARTS (Differentiable Architecture Search) by looking at the eigenvalues of the Hessian of validation loss w.r.t. the architecture and propose robustifications based on our analysis.

(TL;DR, from

Paper | Code

The poor cells standard DARTS finds on spaces S1-S4. For all spaces, DARTS chooses mostly parameter-less operations (skip connection) or even the harmful Noise operation. Shown are the normal cells on CIFAR-10.

Arber Zela

First author: Arber Zela

LinkedIn | GitHub 

5. Comparing Rewinding and Fine-tuning in Neural Network Pruning

Instead of fine-tuning after pruning, rewind weights or learning rate schedule to their values earlier in training and retrain from there to achieve higher accuracy when pruning neural networks.

(TL;DR, from

Paper | Code

The best achievable accuracy across retraining times by one-shot pruning.

Alex Renda

First author: Alex Renda

Twitter | LinkedIn | GitHub | Website 

6. Neural Arithmetic Units

Neural nets, while capable of approximating complex functions, are rather poor in exact arithmetic operations. This task was a longstanding challenge to deep learning researchers. Here, the novel, Neural Addition Unit (NAU) and Neural Multiplication Unit (NMU) are presented, capable of performing exact addition/subtraction (NAU) and multiplying subsets of a vector (MNU). Notable first author is an independent researcher 🙂

Paper | Code

Visualization of the NMU, where the weights (Wi,j ) controls gating between 1 (identity) or xi, each intermediate result is then multiplied explicitly to form zj.

 Andreas Madsen

First author: Andreas Madsen

Twitter | LinkedIn | GitHub

7.The Break-Even Point on Optimization Trajectories of Deep Neural Networks

In the early phase of training of deep neural networks there exists a “break-even point” which determines properties of the entire optimization trajectory.

(TL;DR, from


Visualization of the early part of the training trajectories on CIFAR-10 (before reaching 65% training accuracy) of a simple CNN model optimized using SGD with learning rates η = 0.01 (red) and η = 0.001 (blue). Each model on the training trajectory, shown as a point, is represented by its test predictions embedded into a two-dimensional space using UMAP. The background color indicates the spectral norm of the covariance of gradients K (λ1K, left) and the training accuracy (right). For lower η, after reaching what we call the break-even point, the trajectory is steered towards a region characterized by larger λ1K (left) for the same training accuracy (right).

Stanisław Jastrzębski

First author: Stanisław Jastrzębski

Twitter | LinkedIn | GitHub

8. Hoppity: Learning Graph Transformations To Detect And Fix Bugs In Programs

An learning-based approach for detecting and fixing bugs in Javascript.

(TL;DR, from


Example programs that illustrate limitations of existing approaches inculding both rulebased static analyzers and neural-based bug predictors.

Elizabeth Dinella

First author: Elizabeth Dinella

Twitter | LinkedIn | GitHub

9. Selection via Proxy: Efficient Data Selection for Deep Learning

We can significantly improve the computational efficiency of data selection in deep learning by using a much smaller proxy model to perform data selection.

(TL;DR, from

Paper | Code

SVP applied to active learning (left) and core-set selection (right). In active learning, we followed the same iterative procedure of training and selecting points to label as traditional approaches but replaced the target model with a cheaper-to-compute proxy model. For core-set selection, we learned a feature representation over the data using a proxy model and used it to select points to train a larger, more accurate model. In both cases, we found the proxy and target model have high rank-order correlation, leading to similar selections and downstream results.

Cody Coleman

First author: Cody Coleman

Twitter | LinkedIn | GitHub | Website

10. And the Bit Goes Down: Revisiting the Quantization of Neural Networks

Using a structured quantization technique aiming at better in-domain reconstruction to compress convolutional neural networks.

(TL;DR, from

Paper | Code

Illustration of our method. We approximate a binary classifier ϕ that labels images as dogs or cats by quantizing its weights. Standard method: quantizing ϕ with the standard objective function (1) promotes a classifier ϕbstandard that tries to approximate ϕ over the entire input space and can thus perform badly for in-domain inputs. Our method: quantizing ϕ with our objective function (2) promotes a classifier ϕbactivations that performs well for in-domain inputs. Images lying in the hatched area of the input space are correctly classified by ϕactivations but incorrectly by ϕstandard.

Pierre Stock

First author: Pierre Stock

Twitter | LinkedIn | GitHub

11. A Signal Propagation Perspective for Pruning Neural Networks at Initialization

We formally characterize the initialization conditions for effective pruning at initialization and analyze the signal propagation properties of the resulting pruned networks which leads to a method to enhance their trainability and pruning results.

(TL;DR, from


(left) layerwise sparsity patterns c ∈ {0, 1} 100×100 obtained as a result of pruning for the sparsity level κ¯ = {10, .., 90}%. Here, black(0)/white(1) pixels refer to pruned/retained parameters; (right) connection sensitivities (CS) measured for the parameters in each layer. All networks are initialized with γ = 1.0. Unlike the linear case, the sparsity pattern for the tanh network is nonuniform over different layers. When pruning for a high sparsity level (e.g., κ¯ = 90%), this becomes critical and leads to poor learning capability as there are only a few parameters left in later layers. This is explained by the connection sensitivity plot which shows that for the nonlinear network parameters in later layers have saturating, lower connection sensitivities than those in earlier layers.

Namhoon Lee

First author: Namhoon Lee

Twitter | LinkedIn | GitHub | Website

12. Deep Semi-Supervised Anomaly Detection

We introduce Deep SAD, a deep method for general semi-supervised anomaly detection that especially takes advantage of labeled anomalies.

(TL;DR, from

Paper | Code

The need for semi-supervised anomaly detection: The training data (shown in (a)) consists of (mostly normal) unlabeled data (gray) as well as a few labeled normal samples (blue) and labeled anomalies (orange). Figures (b)–(f) show the decision boundaries of the various learning paradigms at testing time along with novel anomalies that occur (bottom left in each plot). Our semi-supervised AD approach takes advantage of all training data: unlabeled samples, labeled normal samples, as well as labeled anomalies. This strikes a balance between one-class learning and classification.

Lukas Ruff

First author: Lukas Ruff

Twitter | LinkedIn | GitHub

13. Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells

We propose a representation learning model called Space2vec to encode the absolute positions and spatial relationships of places.

(TL;DR, from

Paper | Code

The challenge of joint modeling distributions with very different characteristics. (a)(b) The POI locations (red dots) in Las Vegas and Space2Vec predicted conditional likelihood of Women’s Clothing (with a clustered distribution) and Education (with an even distribution). The dark area in (b) indicates that the downtown area has more POIs of other types than education. (c) Ripley’s K curves of POI types for which Space2Vec has the largest and smallest improvement over wrap (Mac Aodha et al., 2019). Each curve represents the number of POIs of a certain type inside certain radios centered at every POI of that type; (d) Ripley’s K curves renormalized by POI densities and shown in log-scale. To efficiently achieve multi-scale representation Space2Vec concatenates the grid cell encoding of 64 scales (with wave lengths ranging from 50 meters to 40k meters) as the first layer of a deep model, and trains with POI data in an unsupervised fashion.

Gengchen Mai

First author: Gengchen Mai

Twitter | LinkedIn | GitHub | Website

14. Federated Learning with Matched Averaging

Communication efficient federated learning with layer-wise matching.

(TL;DR, from

Paper | Code

Comparison among various federated learning methods with limited number of communications on LeNet trained on MNIST; VGG-9 trained on CIFAR-10 dataset; LSTM trained on Shakespeare dataset over: (a) homogeneous data partition (b) heterogeneous data partition. 

Hongyi Wang

First author: Hongyi Wang

Twitter | LinkedIn | GitHub 

15. Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation

Reinforcement Learning and Adaptive Sampling for Optimized Compilation of Deep Neural Networks.

(TL;DR, from


Overview of our model compilation workflow, and highlighted is the scope of this work.

Byung Hoon Ahn

First author: Byung Hoon Ahn

Twitter | LinkedIn | GitHub | Website

16. Network Deconvolution

We propose a method called network deconvolution that resembles animal vision system to train convolution networks better.

(TL;DR, from

Paper | Code

Performing convolution on this real world image using a correlative filter, such as a Gaussian kernel, adds correlations to the resulting image, which makes object recognition more difficult. The process of removing this blur is called deconvolution. What if, however, what we saw as the real world image was itself the result of some unknown correlative filter, which has made recognition more difficult? Our proposed network deconvolution operation can decorrelate underlying image features which allows neural networks to perform better.

Chengxi Ye

First author: Chengxi Ye

Twitter | GitHub


Depth and breadth of the ICLR publications is quite inspiring. Here, I just presented the tip of an iceberg focusing on the “deep learning” topic. However, this analysis, suggests that there were few popular areas, specifically:

  1. Deep learning (covered in this post)
  2. Reinforcement learning (here)
  3. Generative models (here)
  4. Natural Language Processing/Understanding (here)

In order to create a more complete overview of the top papers at ICLR, we are building a series of posts, each focused on one topic mentioned above. You may want to check them out for a more complete overview.

Happy reading!

AI Research Advocate


ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

Jakub Czakon | Posted November 26, 2020

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”

– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

  • use different models and model hyperparameters
  • use different training or evaluation data, 
  • run different code (including this small change that you wanted to test quickly)
  • run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics. 

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.  

This is where ML experiment tracking comes in. 

Continue reading ->