Last week I had the pleasure to participate in an ECML-PKDD 2020 Conference. The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases is one of the most recognized academic conferences on ML in Europe.
Fully online event, run around the clock – nice idea to make it accessible in all time zones. Conference schedule, neatly divided into many tracks on various flavours made it simple to dive into my favourite topics in reinforcement learning, adversarial learning and meta-topics.
ECML-PKDD brings a large number of new ideas and inspiring developments in the ML field, so I wanted to pick top papers and share them here.
In this post, I focus on research papers, which are divided according to the following categories:
Enjoy!
Reinforcement learning
1. EgoMap: Projective mapping and structured egocentric memory for Deep RL
Paper abstract: Tasks involving localization, memorization and planning in partially observable 3D environments are an ongoing challenge in Deep Reinforcement Learning. We present EgoMap, a spatially structured neural memory architecture. EgoMap augments a deep reinforcement learning agent’s performance in 3D environments on challenging tasks with multi-step objectives. (…)
2. Option Encoder: A Framework for Discovering a Policy Basis in Reinforcement Learning
Paper abstract: Option discovery and skill acquisition frameworks are integral to the functioning of a hierarchically organized Reinforcement learning agent. However, such techniques often yield a large number of options or skills, which can be represented succinctly by filtering out any redundant information. Such a reduction can decrease the required computation while also improving the performance on a target task. To compress an array of option policies, we attempt to find a policy basis that accurately captures the set of all options. In this work, we propose Option Encoder, an auto-encoder based framework with intelligently constrained weights, that helps discover a collection of basis policies. (…)
Main authors:

Rahul Ramesh
3. ELSIM: End-to-end learning of reusable skills through intrinsic motivation
Paper abstract: Taking inspiration from developmental learning, we present a novel reinforcement learning architecture which hierarchically learns and represents self-generated skills in an end-to-end way. With this architecture, an agent focuses only on task-rewarded skills while keeping the learning process of skills bottom-up. This bottom-up approach allows to learn skills that 1- are transferable across tasks, 2- improves exploration when rewards are sparse. To do so, we combine a previously defined mutual information objective with a novel curriculum learning algorithm, creating an unlimited and explorable tree of skills. (…)
First author:
Arthur Aubret
4. Graph-based Motion Planning Networks
Paper abstract: Differentiable planning network architecture has shown to be powerful in solving transfer planning tasks while it possesses a simple end-to-end training feature. (…) However, existing frameworks can only learn and plan effectively on domains with a lattice structure, i.e., regular graphs embedded in a particular Euclidean space. In this paper, we propose a general planning network called Graph-based Motion Planning Networks (GrMPN). GrMPN will be able to i) learn and plan on general irregular graphs, hence ii) render existing planning network architectures special cases. (…)
First author:
Tai Hoang
Clustering
1. Utilizing Structure-rich Features to improve Clustering
Paper abstract: For successful clustering, an algorithm needs to find the boundaries between clusters. While this is comparatively easy if the clusters are compact and non-overlapping and thus the boundaries clearly defined, features where the clusters blend into each other hinder clustering methods to correctly estimate these boundaries. Therefore, we aim to extract features showing clear cluster boundaries and thus enhance the cluster structure in the data. Our novel technique creates a condensed version of the data set containing the structure important for clustering, but without the noise-information. We demonstrate that this transformation of the data set is much easier to cluster for k-means, but also various other algorithms. Furthermore, we introduce a deterministic initialisation strategy for k-means based on these structure-rich features. (…)

2. Online Binary Incomplete Multi-view Clustering
Paper abstract: Multi-view clustering has attracted considerable attention in the past decades, due to its good performance on the data with multiple modalities or from diverse sources. In real-world applications, multi-view data often suffer from incompleteness of instances. Clustering on such multi-view data is called incomplete multi-view clustering (IMC). Most of the existing IMC solutions are offline and have high computational and memory costs especially for large-scale datasets. To tackle these challenges, in this paper, we propose a Online Binary Incomplete Multi-view Clustering (OBIMC) framework. OBIMC robustly learns the common compact binary codes for incomplete multi-view features. (…)
First author:
Longqi Yang
3. Simple, Scalable, and Stable Variational Deep Clustering
Paper abstract: Deep clustering (DC) has become the state-of-the-art for unsupervised clustering. In principle, DC represents a variety of unsupervised methods that jointly learn the underlying clusters and the latent representation directly from unstructured datasets. However, DC methods are generally poorly applied due to high operational costs, low scalability, and unstable results. In this paper, we first evaluate several popular DC variants in the context of industrial applicability using eight empirical criteria. We then choose to focus on variational deep clustering (VDC) methods, since they mostly meet those criteria except for simplicity, scalability, and stability. (…)
4. Gauss Shift: Density Attractor Clustering Faster than Mean Shift
Paper abstract: Mean shift is a popular and powerful clustering method. While techniques exist that improve its absolute runtime, no method has been able to effectively improve its quadratic time complexity with regard to dataset size. To enable development of an alternative, faster method that leads to the same results, we first contribute the formal cluster definition, which mean shift implicitly follows. Based on this definition we derive and contribute Gauss shift – a method that has linear time complexity. We quantify the characteristics of Gauss shift using synthetic datasets with known topologies. We further qualify Gauss shift using real-life data from active neuroscience research, which is the most comprehensive description of any subcellular organelle to date.
Architecture of neural networks
1. Finding the Optimal Network Depth in Classification Tasks
Paper abstract: We develop a fast end-to-end method for training lightweight neural networks using multiple classifier heads. By allowing the model to determine the importance of each head and rewarding the choice of a single shallow classifier, we are able to detect and remove unneeded components of the network. This operation, which can be seen as finding the optimal depth of the model significantly reduces the number of parameters and accelerates inference across different hardware processing units, which is not the case for many standard pruning methods. (…)
Main authors:
Bartosz Wójcik
Maciej Wołczyk
2. XferNAS: Transfer Neural Architecture Search
Paper abstract: The term Neural Architecture Search (NAS) refers to the automatic optimization of network architectures for a new, previously unknown task. Since testing an architecture is computationally very expensive, many optimizers need days or even weeks to find suitable architectures. However, this search time can be significantly reduced if knowledge from previous searches on different tasks is reused. In this work, we propose a generally applicable framework that introduces only minor changes to existing optimizers to leverage this feature. (…) In addition, we observe new records of 1.99 and 14.06 for NAS optimizers on the CIFAR benchmarks, respectively. In a separate study, we analyze the impact of the amount of source and target data. (…)
3. Topological Insights into Sparse Neural Networks
Paper abstract: Sparse neural networks are effective approaches to reduce the resource requirements for the deployment of deep neural networks. Recently, the concept of adaptive sparse connectivity, have emerged to allow training sparse neural networks from scratch by optimizing the sparse structure during training. (…) In this work, we introduce an approach to understand and compare sparse neural network topologies from the perspective of graph theory. We first propose Neural Network Sparse Topology Distance (NNSTD) to measure the distance between different sparse neural networks. Further, we demonstrate that sparse neural networks can outperform over-parameterized models in terms of performance, even without any further structure optimization. (…)
Main authors:

Shiwei Liu
Transfer and multi-task learning
1. Graph Diffusion Wasserstein Distances
Paper abstract: Optimal Transport (OT) for structured data has received much attention in the machine learning community, especially for addressing graph classification or graph transfer learning tasks. In this paper, we present the Diffusion Wasserstein (DW) distance, as a generalization of the standard Wasserstein distance to undirected and connected graphs where nodes are described by feature vectors. DW is based on the Laplacian exponential kernel and benefits from the heat diffusion to catch both structural and feature information from the graphs. (…)
First author:
Amélie Barbe
2. Towards Interpretable Multi Task Learning using bi-level programming
Paper abstract: Global Interpretable Multi Task Learining can be expressed as learning a sparse graph of the task relationship based on the prediction performance of the learned model. We proposed a bilevel formulation of the regression multi task problem that learns a sparse graph. We show that this sparse graph improves the interpretability of the learned models.

3. Diversity-Based Generalization for Unsupervised Text Classification under Domain Shift
Paper abstract: Domain adaptation approaches seek to learn from a source domain and generalize it to an unseen target domain. (…) In this paper, we propose a novel method for domain adaptation of single-task text classification problems based on a simple but effective idea of diversity-based generalization that does not require unlabeled target data. Diversity plays the role of promoting the model to better generalize and be indiscriminate towards domain shift by forcing the model not to rely on same features for prediction. We apply this concept on the most explainable component of neural networks, the attention layer. (…)
4. Deep Learning, Grammar Transfer, and Transportation Theory
Paper abstract: Despite its widespread adoption and success, deep learning-based artificial intelligence techniques have limitations in providing an understandable decision-making process. This makes the “intelligence” part questionable since we expect a real artificial intelligence to not only complete a given task but also perform in a way that is understandable from a human perspective. For this to happen, we need to build a connection between artificial intelligence and human intelligence. Here, we use grammar transfer to demonstrate a paradigm that for connecting these two types of intelligence. (…)
First author:
Kaixuan Zhang
5. Unsupervised Domain Adaptation with Joint Domain-Adversarial Reconstruction Networks
Paper abstract: Unsupervised Domain Adaptation(UDA) attempts to transfer knowledge from a labeled source domain to an unlabeled target domain. (…) we propose in this paper a novel model called Joint Domain-Adversarial Reconstruction Network(JDARN), which integrates domain-adversarial learning with data reconstruction to learn both domain–invariant and domain-specific representations. Meanwhile, we propose to employ two novel discriminators called joint domain-class discriminators to achieve the joint alignment and adopt a novel joint adversarial loss to train them. (…)
First author:
Qian Chen
Federated learning and clustering
1. An algorithmic framework for decentralised matrix factorisation
Paper abstract: We propose a framework for fully decentralised machine learning and apply it to latent factor models for top-N recommendation. The training data in a decentralised learning setting is distributed across multiple agents, who jointly optimise a common global objective function (the loss function). Here, in contrast to the client-server architecture of federated learning, the agents communicate directly, maintaining and updating their own model parameters, without central aggregation and without sharing their own data. (…)
Main authors:
Erika Duriakova
Weipeng Huang
2. Federated Multi-view Matrix Factorization for Personalized Recommendations
Paper abstract: We introduce the federated multi-view matrix factorization method that extends the federated learning framework to matrix factorization with multiple data sources. Our method is able to learn the multi-view model without transferring the user’s personal data to a central server. As far as we are aware this is the first federated model to provide recommendations using multi-view matrix factorization. The model is rigorously evaluated on three datasets on production settings. (…)
3. FedMAX: Mitigating Activation Divergence for Accurate and Communication-Efficient Federated Learning
Paper abstract: In this paper, we identify a new phenomenon called activation-divergence that happens in Federated Learning due to data heterogeneity. Specifically, we argue that activation vectors can diverge when using federated learning, even if a subset of users share a few common classes with data residing on different devices. To address this issue, we introduce a prior based on the Principle of Maximum Entropy; this prior assumes minimal information about the per-device activation vectors and aims at making activation vectors for same classes similar across multiple devices. (…)
First author:
Wei Chen
4. Model-based Clustering with HDBSCAN*
Paper abstract: We propose an efficient model-based clustering approach for creating Gaussian Mixture Models from finite datasets. Models are extracted from HDBSCAN* hierarchies using the Classification Likelihood and the Expectation Maximization algorithm. Prior knowledge of the number of components of the model, corresponding to the number of clusters, is not necessary and can be determined dynamically. Due to relatively small hierarchies created by HDBSCAN* compared to previous approaches, this can be done efficiently. (…)
First author:
Michael Strobl
Network modeling
1. Progressive Supervision for Node Classification
Paper abstract: Graph Convolution Networks (GCNs) are a powerful approach for the task of node classification, in which GCNs are trained by minimizing the loss over the final-layer predictions. However, a limitation of this training scheme is that it enforces every node to be classified from the fixed and unified size of receptive fields, which may not be optimal. We propose ProSup (Progressive Supervision), that improves the effectiveness of GCNs by training them in a different way. ProSup supervises all layers progressively to guide their representations towards the characteristics we desire. (…)
First author:
Yiwei Wang
2. Modeling Dynamic Heterogeneous Network for Link Prediction using Hierarchical Attention with Temporal RNN
Paper abstract: Network embedding aims to learn low-dimensional representations of nodes while capturing structure information of networks. (…) In this paper, we propose a novel dynamic heterogeneous network embedding method, termed as DyHATR, which uses hierarchical attention to learn heterogeneous information and incorporates recurrent neural networks with temporal attention to capture evolutionary patterns. (…)
3. GIKT: A Graph-based Interaction Model for Knowledge Tracing
Paper abstract: With the rapid development in online education, knowledge tracing (KT) has become a fundamental problem which traces students’ knowledge status and predicts their performance on new questions. Questions are often numerous in online education systems, and are always associated with much fewer skills. (…) In this paper, we propose a Graph-based Interaction model for Knowledge Tracing (GIKT) to tackle the above probems. More specifically, GIKT utilizes graph convolution network (GCN) to substantially incorporate question-skill correlations via embedding propagation. (…)
First author:
Yang Yang
Graph neural networks
1. GRAM-SMOT: Top-N Personalized Bundle Recommendation via Graph Attention Mechanism and Sub-Modular Optimization
Paper abstract: Bundle recommendation — recommending a group of products in place of individual products to customers is gaining attention day by day. It presents two interesting challenges — (1) how to efficiently recommend existing bundles to users, and (2) how to generate personalized novel bundles targeting specific users. (…) In this work, we propose GRAM-SMOT — a graph attention-based framework to address the above challenges. Further, we define a loss function based on metric-learning approach to learn the embeddings of entities efficiently. (…)

First author:
Vijaikumar M
2. Temporal Heterogeneous Interaction Graph Embedding For Next-Item Recommendation
Paper abstract: In the scenario of next-item recommendation, previous methods attempt to model user preferences by capturing the evolution of sequential interactions. However, their sequential expression is often limited, without modeling complex dynamics that short-term demands can often be influenced by long-term habits. Moreover, few of them take into account the heterogeneous types of interaction between users and items. In this paper, we model such complex data as a Temporal Heterogeneous Interaction Graph (THIG) and learn both user and item embeddings on THIGs to address the next-item recommendation. The main challenges involve two aspects: the complex dynamics and rich heterogeneity of interactions. (…)

3. A Self-Attention Network based Node Embedding Model
Paper abstract: Despite several signs of progress have been made recently, limited research has been conducted for an inductive setting where embeddings are required for newly unseen nodes — a setting encountered commonly in practical applications of deep learning for graph networks. (…) To this end, we propose SANNE — a novel unsupervised embedding model — whose central idea is to employ a self-attention mechanism followed by a feed-forward network, in order to iteratively aggregate vector representations of nodes in sampled random walks. (…)
4. Graph-Revised Convolutional Network
Paper abstract: Graph Convolutional Networks (GCNs) have received increasing attention in the machine learning community for effectively leveraging both the content features of nodes and the linkage patterns across graphs in various applications. (…) This paper proposes a novel framework called Graph-Revised Convolutional Network (GRCN), which avoids both extremes. Specifically, a GCN-based graph revision module is introduced for predicting missing edges and revising edge weights w.r.t. downstream tasks via joint optimization. (…)
5. Robust Training of Graph Convolutional Networks via Latent Perturbation
Paper abstract: Despite the recent success of graph convolutional networks (GCNs) in modeling graph structured data, its vulnerability to adversarial attacks has been revealed and attacks on both node feature and graph structure have been designed. (…) We propose addressing this issue by perturbing the latent representations in GCNs, which not only dispenses with generating adversarial networks, but also attains improved robustness and accuracy by respecting the latent manifold of the data. This new framework of latent adversarial training on graphs is applied to node classification, link prediction, and recommender systems. (…)
NLP
1. Early Detection of Fake News with Multi-Source Weak Social Supervision
Paper abstract: Social media has greatly enabled people to participate in online activities at an unprecedented rate. However, this unrestricted access also exacerbates the spread of misinformation and fake news online which might cause confusion and chaos unless being detected early for its mitigation. (…) In this work, we exploit multiple weak signals from different sources given by user and content engagements and their complementary utilities to detect fake news. We jointly leverage the limited amount of clean data along with weak signals from social engagements to train deep neural networks in a meta-learning framework to estimate the quality of different weak instances. (…)
2. Generating Financial Reports from Macro News via Multiple edits Neural Networks
Paper abstract: Automatically generating financial reports given a piece of breaking macro news is quite challenging task. Essentially, this task is a text-to-text generation problem but is to learn long text, i.e., greater than 40 words, from a piece of short news. (…) To address this issue, we propose the novel multiple edits neural networks approach which first learns the outline for given news and then generates financial reports from the learnt outline. Particularly, the input news is first embedded via skip-gram model and is then fed into Bi-LSTM component to train the contextual representation vector. (…)
First author:
Yunpeng Ren
3. Inductive Document Representation Learning for Short Text Clustering
Paper abstract: Short text clustering (STC) is an important task that can discover topics or groups in the fast-growing social networks, e.g., Tweets and Google News. (…) Inspired by the mechanism of vertex information propagation guided by the graph structure in GNNs, we propose an inductive document representation learning model, called IDRL, that can map the short text structures into a graph network and recursively aggregate the neighbor information of the words in the unseen documents. Then, we can reconstruct the representations of the previously unseen short texts with the limited numbers of word embeddings learned before. (…)
First author:
Junyang Chen
4. Hierarchical Interaction Networks with Rethinking Mechanism for Document-level Sentiment Analysis
Paper abstract: Document-level Sentiment Analysis (DSA) is more challenging due to vague semantic links and complicate sentiment information. Recent works have been devoted to leveraging text summarization and have achieved promising results. However, these summarization-based methods did not take full advantage of the summary including ignoring the inherent interactions between the summary and document. (…) In this paper, we study how to effectively generate a discriminative representation with explicit subject patterns and sentiment contexts for DSA. A HierarchicalInteraction Networks (HIN) is proposed to explore bidirectional interactions between the summary and document at multiple granularities and learn subject-oriented document representations for sentiment classification. (…)
First author:
Lingwei Wei
5. Learning a Sequence of Sentiment Classification Tasks
Paper abstract: This paper studies sentiment classification (SC) in the lifelong learning setting (LL) in order to improve the SC accuracy. In the LL setting, the system learns a sequence of SC tasks incrementally in a neural network. This scenario is common in sentiment analysis applications because a sentiment analysis company needs to work on a large number of tasks for different clients. (…) This paper proposes a novel technique called KAN to achieve these objectives. KAN can markedly improve the SC accuracy of both the new task and the old tasks via forward and backward knowledge transfer. (…)
First author:
Zixuan Ke
Time series and recurrent neural networks
1. The Temporal Dictionary Ensemble (TDE) Classifier for Time Series Classification
Paper abstract: Using bag of words representations of time series is a popular approach to time series classification (TSC). These algorithms involve approximating and discretising windows over a series to form words, then forming a count of words over a given dictionary. Classifiers are constructed on the resulting histograms of word counts. A 2017 evaluation of a range of time series classifiers found the bag of symbolic-Fourier approximation symbols (BOSS) ensemble the best of the dictionary based classifiers. (…) We propose a further extension of these dictionary based classifiers that combines the best elements of the others combined with a novel approach to constructing ensemble members based on an adaptive Gaussian process model of the parameter space. (…)
2. Incremental training of a recurrent neural network exploiting a multi-scale dynamic memory
Paper abstract: The effectiveness of recurrent neural networks can be largely influenced by their ability to store into their dynamical memory information extracted from input sequences at different frequencies and timescales. (…) In this paper we propose a novel incrementally trained recurrent architecture targeting explicitly multi-scale learning. First, we show how to extend the architecture of a simple RNN by separating its hidden state into different modules, each subsampling the network hidden activations at different frequencies. Then, we discuss a training algorithm where new modules are iteratively added to the model to learn progressively longer dependencies. (…)
3. Flexible Recurrent Neural Networks
Paper abstract: We introduce two methods enabling recurrent neural networks (RNNs) to trade off accuracy for computational cost during the analysis of a sequence. (…) The first approach makes minimal changes to the model. Therefore it avoids loading new parameters from slow memory. In the second approach, different models can replace one another within a sequence analysis. The latter works on more data sets. (…)
Main authors:
Anne Lambert
4. Z-Embedding: A Spectral Representation of Event Intervals for Efficient Clustering and Classification
Paper abstract: Sequences of event intervals occur in several application domains, while their inherent complexity hinders scalable solutions to tasks such as clustering and classification. In this paper, we propose a novel spectral embedding representation of event interval sequences that relies on bipartite graphs. More concretely, each event interval sequence is represented by a bipartite graph by following three main steps: (1) creating a hash table that can quickly convert a collection of event interval sequences into a bipartite graph representation, (2) creating and regularizing a bi-adjacency matrix corresponding to the bipartite graph, (3) defining a spectral embedding mapping on the bi-adjacency matrix. (…)

First author:
Zed Lee
Dimensionality reduction and auto-encoders
1. Simple and Effective Graph Autoencoders with One-Hop Linear Models
Paper abstract: Over the last few years, graph autoencoders (AE) and variational autoencoders (VAE) emerged as powerful node embedding methods, (…). Graph AE, VAE and most of their extensions rely on multi-layer graph convolutional networks (GCN) encoders to learn vector space representations of nodes. In this paper, we show that GCN encoders are actually unnecessarily complex for many applications. We propose to replace them by significantly simpler and more interpretable linear models w.r.t. the direct neighborhood (one-hop) adjacency matrix of the graph, involving fewer operations, fewer parameters and no activation function. (…)
2. Sparse Separable Nonnegative Matrix Factorization
Paper abstract: We propose a new variant of nonnegative matrix factorization (NMF), combining separability and sparsity assumptions. Separability requires that the columns of the first NMF factor are equal to columns of the input matrix, while sparsity requires that the columns of the second NMF factor are sparse. We call this variant sparse separable NMF (SSNMF), which we prove to be NP-hard, as opposed to separable NMF which can be solved in polynomial time. (…)
Large-scale optimization and differential privacy
1. Orthant Based Proximal Stochastic Gradient Method for l1-Regularized Optimization
Paper abstract: Sparsity-inducing regularization problems are ubiquitous in machine learning applications, ranging from feature selection to model compression. In this paper, we present a novel stochastic method – Orthant Based Proximal Stochastic Gradient Method (OBProx-SG) – to solve perhaps the most popular instance, i.e., the l1-regularized problem. The OBProx-SG method contains two steps: (i) a proximal stochastic graident step to predict a support cover of the solution; and (ii) an orthant step to aggressively enhance the sparisity level via orthant face projection. (…)
2. Efficiency of Coordinate Descent Methods For Structured Nonconvex Optimization
Paper abstract: Novel coordinate descent (CD) methods are proposed for minimizing nonconvex functions consisting of three terms: (i) a continuously differentiable term, (ii) a simple convex term, and (iii) a concave and continuous term. First, by extending randomized CD to nonsmooth nonconvex settings, we develop a coordinate subgradient method that randomly updates block-coordinate variables by using block composite subgradient mapping. (…) Second, we develop a randomly permuted CD method with two alternating steps: linearizing the concave part and cycling through variables. (…) Third, we extend accelerated coordinate descent (ACD) to nonsmooth and nonconvex optimization to develop a novel randomized proximal DC algorithm whereby we solve the subproblem inexactly by ACD. (…)
First author:
Qi Deng
3. Escaping Saddle Points of Empirical Risk Privately and Scalably via DP-Trust Region Method
Paper abstract: It has been shown recently that many non-convex objective/loss functions in machine learning and deep learning are known to be strict saddle. This means that finding a second-order stationary point ({em i.e.,} approximate local minimum) and thus escaping saddle points are sufficient for such functions to obtain a classifier with good generalization performance. Existing algorithms for escaping saddle points, however, all fail to take into consideration a critical issue in their designs, that is, the protection of sensitive information in the training set.(…) In this paper, we investigate the problem of privately escaping saddle points and finding a second-order stationary point of the empirical risk of non-convex loss function. (…)
Adversarial learning
1. Adversarial Learned Molecular Graph Inference and Generation
Paper abstract: Recent methods for generating novel molecules use graph representations of molecules and employ various forms of graph convolutional neural networks for inference. However, training requires solving an expensive graph isomorphism problem, which previous approaches do not address or solve only approximately. In this work, we propose ALMGIG, a likelihood-free adversarial learning framework for inference and de novo molecule generation that avoids explicitly computing a reconstruction loss. Our approach extends generative adversarial networks by including an adversarial cycle-consistency loss to implicitly enforce the reconstruction property. (…)
2. A Generic and Model-Agnostic Exemplar Synthetization Framework for Explainable AI
Paper abstract: With the growing complexity of deep learning methods adopted in practical applications, there is an increasing and stringent need to explain and interpret the decisions of such methods. In this work, we focus on explainable AI and propose a novel generic and model-agnostic framework for synthesizing input exemplars that maximize a desired response from a machine learning model. To this end, we use a generative model, which acts as a prior for generating data, and traverse its latent space using a novel evolutionary strategy with momentum updates. (…)
3. Quality Guarantees for Autoencoders via Unsupervised Adversarial Attacks
Paper abstract: Autoencoders are an essential concept in unsupervised learning. Currently, the quality of autoencoders is assessed either internally (e.g. based on mean square error) or externally (e.g. by classification performance). Yet, there is no possibility to prove that autoencoders generalize beyond the finite training data, and hence, they are not reliable for safety-critical applications that require formal guarantees also for unseen data.To address this issue, we propose the first framework to bound the worst-case error of an autoencoder within a safety-critical region of an infinite value domain, as well as the definition of unsupervised adversarial examples that cause such worst-case errors. (…)

4. On Saliency Maps and Adversarial Robustness
Paper abstract: A very recent trend has emerged to couple the notion of interpretability and adversarial robustness, unlike earlier efforts which solely focused on good interpretations or robustness against adversaries. (…) In this work, we provide a different perspective to this coupling, and provide a method, Saliency based Adversarial training (SAT), to use saliency maps to improve adversarial robustness of a model. In particular, we show that using annotations such as bounding boxes and segmentation masks, already provided with a dataset, as weak saliency maps, suffices to improve adversarial robustness with no additional effort to generate the perturbations themselves. (…)
5. Scalable Backdoor Detection in Neural Networks
Paper abstract: Recently, it has been shown that deep learning models are vulnerable to Trojan attacks. In the Trojan attacks, an attacker can install a backdoor during training to make the model misidentify samples contaminated with a small trigger patch. Current backdoor detection methods fail to achieve good detection performance and are computationally expensive. In this paper, we propose a novel trigger reverse-engineering based approach whose computational complexity does not scale up with the number of labels and is based on a measure that is both interpretable and universal across different networks and patch types. (…)
Theory for deep learning
1. A³ : Activation Anomaly Analysis
Paper abstract: Inspired by recent advances in coverage-guided analysis of neural networks, we propose a novel anomaly detection method. We show that the hidden activation values contain information useful to distinguish between normal and anomalous samples. Our approach combines three neural networks in a purely data-driven end-to-end model. Based on the activation values in the target network, the alarm network decides if the given sample is normal. Thanks to the anomaly network, our method even works in strict semi-supervised settings. (…)
Main authors:
Philip Sperl
2. Effective Version Space Reduction for Convolutional Neural Networks
Paper abstract: In active learning, sampling bias could pose a serious inconsistency problem and hinder the algorithm from finding the optimal hypothesis. However, many methods are hypothesis space agnostic and do not address this problem. We examine active learning with deep neural networks through the principled lens of version space reduction and check the realizability assumption. Based on their objectives, we identify the core differences between prior mass reduction and diameter reduction methods and propose a new diameter-based querying method – the Gibbs vote disagreement. (…)
Main authors:

Jiayu Liu
3. Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication
Paper abstract: Deep reinforcement learning algorithms have recently been used to train multiple interacting agents in a centralised manner whilst keeping their execution decentralised. When the agents can only acquire partial observations and are faced with tasks requiring coordination and synchronisation skills, inter-agent communication plays an essential role. In this work, we propose a framework for multi-agent training using deep deterministic policy gradients that enables concurrent, end-to-end learning of an explicit communication protocol through a memory device. During training, the agents learn to perform read and write operations enabling them to infer a shared representation of the world. (…)
Main authors:
4. A Principle of Least Action for the Training of Neural Networks
Paper abstract: Neural networks have been achieving high generalization performance on many tasks despite being highly over-parameterized. Since classical statistical learning theory struggles to explain this behaviour, much effort has recently been focused on uncovering the mechanisms behind it, in the hope of developing a more adequate theoretical framework and having a better control over the trained models. In this work, we adopt an alternative perspective, viewing the neural network as a dynamical system displacing input particles over time. We conduct a series of experiments and, by analyzing the network’s behaviour through its displacements, we show the presence of a low kinetic energy bias in the transport map of the network, and link this bias with generalization performance. (…)

Computer vision / image processing
1. Companion Guided Soft Margin for Face Recognition
Paper abstract: Face recognition has achieved remarkable improvements with the help of the angular margin based softmax losses. However, the margin is usually manually set and kept constant during the training process, which neglects both the optimization difficulty and the informative similarity structures among different instances. (…) In this paper, we propose a novel sample-wise adaptive margin loss function from the perspective of the hypersphere manifold structure, which we call companion guided soft margin (CGSM). CGSM introduces the information of the distribution in the feature space, and conducts teacher-student optimization within each mini-batch. (…)
2. Soft Labels Transfer with Discriminative Representations Learning for Unsupervised Domain Adaptation
Paper abstract: Domain adaptation aims to address the challenge of transferring the knowledge obtained from the source domain with rich label information to the target domain with less or even no label information. Recent methods start to tackle this problem by incorporating the hard-pseudo labels for the target samples to better reduce the cross-domain distribution shifts. However, these approaches are vulnerable to the error accumulation and hence unable to preserve cross-domain category consistency. (…) To address this issue, we propose a Soft Labels transfer with Discriminative Representations learning (SLDR) framework to jointly optimize the class-wise adaptation with soft target labels and learn the discriminative domain-invariant features in a unified model. (…)
First author:
Manliang Cao
3. Information-Bottleneck Approach to Salient Region Discovery
Paper abstract: We propose a new method for learning image attention masks in a semi-supervised setting based on the Information Bottleneck principle. Provided with a set of labeled images, the mask generation model is minimizing mutual information between the input and the masked image while maximizing the mutual information between the same masked image and the image label. In contrast with other approaches, our attention model produces a Boolean rather than a continuous mask, entirely concealing the information in masked-out pixels. (…)
4. FAWA: Fast Adversarial Watermark Attack on Optical Character Recognition (OCR) Systems
Paper abstract: Optical character recognition (OCR) is widely applied in real applications serving as a key preprocessing tool, such as information extraction and sentiment analysis. The adoption of deep neural network (DNN) in the OCR results in the vulnerability against adversarial examples which are crafted to mislead the output of the threat model. We propose the fast watermark adversarial attack (FAWA) against a white-box OCR model to produce natural distortion in the disguise of watermarks and evade human eyes’ detection. This paper is the first effort to bring normal adversarial perturbations and watermark together in adversarial attacks and generate adversarial watermarks. (…)
First author:
Lu Chen
Related article
Optimization for deep learning
1. ADMMiRNN: Training RNN with Stable Convergence via An Efficient ADMM Approach
Paper abstract: It is hard to train Recurrent Neural Network (RNN) with stable convergence and avoid gradient vanishing and exploding, as the weights in the recurrent unit are repeated from iteration to iteration. Moreover, RNN is sensitive to the initialization of weights and bias, which brings difficulty in the training phase. With the gradient-free feature and immunity to poor conditions, the Alternating Direction Method of Multipliers (ADMM) has become a promising algorithm to train neural networks beyond traditional stochastic gradient algorithms. However, ADMM could not be applied to train RNN directly, since the state in the recurrent unit is repetitively updated over timesteps. Therefore, this work builds a new framework named ADMMiRNN upon the unfolded form of RNN to address the above challenges simultaneously and provides novel update rules and theoretical convergence analysis. (…)
First author:
Yu Tang
2. Exponential Convergence of Gradient Methods in Network Zero Sum Concave Games
Paper abstract: Motivated by Generative Adverserial Networks, we study the computation of Nash equilibrium in concave emph{network zero sum games} (NZGSs), a multiplayer generalization of two player zero sum games first proposed with linear payoffs by Cai et al.. Extending results by Cai et al., we show that various game theoretic properties of concave-convex two-player zero sum games are preserved in this generalization. We then generalize last iterate convergence results obtained previously in two-player zero sum games. (…)
3. Adaptive Momentum Coefficient for Neural Network Optimization
Paper abstract: We propose a novel and efficient momentum-based first-order algorithm for optimizing neural networks which uses an adaptive coefficient for the momentum term. Our algorithm, called {it Adaptive Momentum Coefficient} (AMoC), utilizes the inner product of the gradient and the previous update to the parameters, to effectively control the amount of weight put on the momentum term based on the change of direction in the optimization path. The algorithm is easy to implement and its computational overhead over momentum methods is negligible. (…)

4. Squeezing Correlated Neurons for Resource-Efficient Deep Neural Networks
Paper abstract: DNNs are abundantly represented in real-life applications because of their accuracy in challenging problems, yet their demanding memory and computational costs challenge their applicability to resource-constrained environments. Taming computational costs has hitherto focused on first-order techniques, such as eliminating numerically insignificant neurons/filters through numerical contribution metric prioritizations, yielding passable improvements. Yet redundancy in DNNs extends well beyond the limits of numerical insignificance. (…) To this end, we employ practical data analysis techniques coupled with a novel feature elimination algorithm to identify a minimal set of computation units that capture the information content of the layer and squash the rest. (…)
Summary
That’s it!
I personally recommend to also go to the event web site and explore your favourite topics in greater depth.
Note, that there’s another post coming with the best applied data science papers, so stay tuned!
If you feel that there is something cool missing, simply let me know, and I will extend this post.