Last week I participated in the ECML-PKDD 2020 Conference. The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases is one of the most recognized academic conferences on ML in Europe.
In the spirit of spreading the word about ML developments, I wanted to share my selection of the best “applied data science” papers from the conference. It is the second post from this series. The previous one about top research papers, can be found here. Make sure to check it as well.
Applied data science papers and presentations were a nice addition to the pure research papers. Thanks to this, the conference was balanced between research and industry topics.
In this post papers are grouped into categories that follow the conference’s scheme:
1. Stop the Clock: Are Timeout Effects Real?
Paper abstract: Timeout is a short interruption during games used to communicate a change in strategy, to give the players a rest or to stop a negative flow in the game. (…) But, how effective are these timeouts in doing so? The simple average of the differences between the scores before and after the timeouts has been used as evidence that there is an effect and that it is substantial. We claim that these statistical averages are not proper evidence and a more sound approach is needed. We applied a formal causal framework using a large dataset of official NBA play-by-play tables and drew our assumptions about the data generation process in a causal graph. (…)
2. Automatic Pass Annotation from Soccer Video Streams based on Object Detection and LSTM
Paper abstract: Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of data that describe all the spatio-temporal events that occur in each match. (…) In this paper, we describe PassNet, a method to recognize the most frequent events in soccer, i.e., passes, from video streams. Our model combines a set of artificial neural networks that perform feature extraction from video streams, object detection to identify the positions of the ball and the players, and classification of frame sequences as passes or not passes. (…)
3. SoccerMix: Representing Soccer Actions with Mixture Models
Paper abstract: Analyzing playing style is a recurring task within soccer analytics that plays a crucial role in club activities such as player scouting and match preparation. (…) Current techniques for analyzing playing style are often hindered by the sparsity of soccer event stream data (i.e., the same player rarely performs the same action in the same location more than once). This paper proposes SoccerMix, a soft clustering technique based on mixture models that enables a novel probabilistic representation for soccer actions. (…)
4. SoccerMap: A Deep Learning Architecture for Visually-Interpretable Analysis in Soccer
Paper abstract: We present a fully convolutional neural network architecture that is capable of estimating full probability surfaces of potential passes in soccer, derived from high-frequency spatiotemporal data. The network receives layers of low-level inputs and learns a feature hierarchy that produces predictions at different sampling levels, capturing both coarse and fine spatial details. By merging these predictions, we can produce visually-rich probability surfaces for any game situation that allows coaches to develop a fine-grained analysis of players’ positioning and decision-making, an as yet little-explored area in sports. (…)
Hardware and manufacturing
1. Learning I/O Access patterns to Improve Prefetching in SSDs
Paper abstract: Flash based solid-state drives have established themselves as a higher-performance alternative to hard disk drives in cloud and mobile environments. Nevertheless, SSDs remain a performance bottleneck of computer systems due to their high I/O access latency. (…) In this paper, we discuss the challenges of prefetching in SSDs, explain why prior approaches fail to achieve high accuracy, and present a neural network based prefetching approach that significantly outperforms the state of the art. To achieve high performance, we address the challenges of prefetching in very large sparse address spaces, as well as prefetching in a timely manner by predicting ahead of time. (…)
2. FlowFrontNet: Improving Carbon Composite Manufacturing with CNNs
Paper abstract: Carbon fiber reinforced polymers (CFRP) are light yet strong composite materials designed to reduce the weight of aerospace or automotive components – contributing to reduced greenhouse gas emissions. Resin transfer molding (RTM) is a manufacturing process for CFRP that can be scaled up to industrial-sized production. (…) We propose FlowFrontNet, a deep learning approach to enhance the in-situ process perspective by learning a mapping from sensors to flow front “images” (using upscaling layers), to capture spatial irregularities in the flow front to predict dry spots (using convolutional layers). (…)
3. On-Site Gamma-Hadron Separation with Deep Learning on FPGAs
Paper abstract: Modern high-energy astroparticle experiments produce large amounts of data everyday in continuous high-volume streams. (…) The separation of gamma rays from from background noise, which is inevitably recorded, is called the Gamma-Hadron separation problem. Current solutions heavily rely on hand-crafted features. (…) The overall machine learning pipeline is executed on commodity computer hardware after an event has occurred. In this paper, we propose an alternative approach which applies Convolutional Neural Networks (CNN) and Binary Neural Networks (BNNs) directly to the raw feature stream of the telescope’s camera. (…)
4. Interpretable dimensionally-consistent feature extraction from electrical network sensors
Paper abstract: Electrical power networks are heavily monitored systems, requiring operators to perform intricate information synthesis before understanding the underlying network state. Our study aims at helping this synthesis step by automatically creating features from the sensor data. We propose a supervised feature extraction approach using a grammar-guided evolution, which outputs interpretable and dimensionally-consistent features. (…)
1. A Multi-Criteria System for Recommending Taxi Routes with an Advance Reservation
Paper abstract: As the demand of taxi reservation services has increased, the strategies of how to increase the income of taxi drivers with advanced service have attracted attention. However, the demand is usually unmet due to the imbalance of profit. In this paper, we propose a multi-criteria route recommendation framework that considers real-time spatial-temporal predictions and traffic network in-formation, aiming to optimize a taxi driver’s profit when the driver has an advance reservation. (…)
2. Autonomous Driving Validation With Model-Based Dictionary Clustering
Paper abstract: The validation of autonomous driving systems remains one of the biggest challenges that car manufacturers must tackle in order to provide secure driverless cars. (…) In this paper, we present a new method applied to time-series produced by autonomous driving numerical simulations. It is a dictionary-based method that consists of three steps: automatic segmentation of each time-series, regime dictionary construction, and clustering of produced categorical sequences. We present the time-series specific structure and the proposed method advantages for processing such data, compared to state-of-the-art reference methods.
3. Automation of Leasing Vehicle Return Assessment Using Deep Learning Models
Paper abstract: The assessment of damages includes classifying damage and estimating its repair cost and is an essential process in vehicle leasing and insurance industries. Particularly, vehicle leasing, the end of lease assessment contributes heavily towards the actual cost the customer has to pay. (…) In this paper, we present a machine learning (ML) based automated solution for the leasing vehicle return assessment. Furthermore, we highlight how the standard ML models and their training protocols fail when dealing with a dataset that has been collected without a standard procedure. (…)
4. Real-time Lane Configuration with Coordinated Reinforcement Learning
Paper abstract: Changing lane configuration of roads, based on traffic patterns, is a proven solution for improving traffic throughput. Traditional lane-direction configuration solutions assume pre-known traffic patterns, hence are not suitable for real-world applications as they are not able to adapt to changing traffic conditions. We propose a dynamic lane configuration solution for improving traffic flow using a two-layer, multi-agent architecture, named Coordinated Learning-based Lane Allocation (CLLA). (…)
5. Learning a Contextual and Topological Representation of Areas-of-Interest for On-Demand Delivery Application
Paper abstract: A good representation of urban areas is of great importance in on-demand delivery services such as for ETA prediction. However, the existing representations learn either from sparse check-in histories or topological geometries, thus are either lacking coverage and violating the geographical law or ignoring contextual information from data. In this paper, we propose a novel representation learning framework for getting a unified representation of Area of Interest from both contextual data (trajectories) and topological data (graphs). (…)
1. Explaining end-to-end ECG automated diagnosis using contextual features
Paper abstract: We propose a new method to generate explanations for end-to-end classification models. The explanations consist of meaningful features to the user, namely contextual features. We instantiate our approach in the scenario of automated electrocardiogram (ECG) diagnosis and analyze the explanations generated in terms of interpretability and robustness. The proposed method uses a noise-insertion strategy to quantify the impact of intervals and segments of the ECG signals on the automated classification outcome. (…)
2. Self-Supervised Log Parsing
Paper abstract: Logs are extensively used during the development and maintenance of software systems. (…) However, large-scale software systems generate massive volumes of semi-structured log records, posing a major challenge for automated analysis. Parsing semi-structured records with free-form text log messages into structured templates is the first and crucial step that enables further analysis. Existing approaches rely on log-specific heuristics or manual rule extraction. (…) We propose a novel parsing technique called NuLog that utilizes a self-supervised learning model and formulates the parsing task as masked language modeling (MLM). (…)
3. A context-based approach to detect abnormal human behaviors in ambient intelligent systems
Paper abstract: Abnormal human behaviors can be signs of a health issue or the occurrence of a hazardous incident. Detecting such behaviors is essential in Ambient Intelligent (AmI) systems to enhance the safety of people. (…) In this paper, a novel approach is proposed to detect such behaviors exploiting the contextual information of human behaviors. (…)
4. Forecasting Error Pattern-based Anomaly Detection in Multivariate Time Series
Paper abstract: The advent of Industry 4.0, partly characterized by the development of cyber-physical systems (CPSs), naturally entails the need for reliable security schemes. (…) In this work, we aim to contribute to the body of literature on the application of anomaly detection techniques in CPSs. We propose novel Functional Data Analysis (FDA) and Autoencoder-based approaches for anomaly detection in the Secure Water Treatment (SWaT) dataset, which realistically represents a scaled-down industrial water treatment plant. (…)
5. Recognizing Complex Activities by a Temporal Causal Network-Based Model
Paper abstract: Complex activity recognition is challenging due to the inherent diversity and causality of performing a complex activity, with each of its instances having its own configuration of primitive events and their temporal causal dependencies. (…) Our approach introduces a temporal causal network generated from an optimized network skeleton to explicitly characterize these unique temporal causal configurations of a particular complex activity as a variable number of nodes and links. (…)
1. Think out of the package: Recommending package types for e-commerce shipments
Paper abstract: Multiple product attributes like dimensions, weight, fragility, liquid content etc. determine the package type used by e-commerce companies to ship products. (…) In this work, we propose a multi-stage approach that trades-off between shipment and damage costs for each product, and accurately assigns the optimal package type using a scalable, computationally efficient linear time algorithm. A simple binary search algorithm is presented to find the hyper-parameter that balances between the shipment and damage costs. (…)
2. Recommending Courses in MOOCs for Jobs: An Auto Weak Supervision Approach
Paper abstract: The proliferation of massive open online courses (MOOCs) demands an effective way of course recommendation for jobs posted in recruitment websites, especially for the people who take MOOCs to find new jobs. (…) This paper proposes a general automated weak supervision framework AutoWeakS via reinforcement learning to solve the problem. On the one hand, the framework enables training multiple supervised ranking models upon the pseudo labels produced by multiple unsupervised ranking models. On the other hand, the framework enables automatically searching the optimal combination of these supervised and unsupervised models. (…)
3. Feedback-guided Attributed Graph Embedding for Relevant Video Recommendation
Paper abstract: Representation learning on graphs, as alternatives to traditional feature engineering, has been exploited in many application domains, ranging from e-commerce to computational biology. (…) In this paper, we present a video embedding approach named Equuleus, which learns video embeddings from user interaction behaviors. In Equuleus, we carefully incorporate user behavior characteristics into the construction of the video graph and the generation of node sequences. (…)
4. Social Influence Attentive Neural Network for Friend-Enhanced Recommendation
Paper abstract: With the thriving of online social networks, there emerges a new recommendation scenario in many social apps, called Friend-Enhanced Recommendation (FER) (…). In FER, a user is recommended with items liked/shared by their friends (called a friend referral circle). (…) In this paper, we first formulate the FER problem, and propose a novel Social Influence Attentive Neural network (SIAN) solution. In order to fuse rich heterogeneous information, the attentive feature aggregator in SIAN is designed to learn user and item representations at both node- and type-levels. (…)
1. Calibrating user response predictions in online advertising
Paper abstract: Predicting user response probability such as click-through rate (CTR) and conversion rate (CVR) accurately is essential to online advertising systems.(…) Due to the sparsity and latency of the user response behaviors such as clicks and conversions, traditional calibration methods may not work well in real-world online advertising systems. In this paper, we present a comprehensive calibration solution for online advertising. More specifically, we propose a calibration algorithm to exploit implicit properties of predicted probabilities to reduce negative impacts of the data sparsity problem. (…)
2. 6VecLM: Language Modeling in Vector Space for IPv6 Target Generation
Paper abstract: Fast IPv6 scanning is challenging in the field of network measurement as it requires exploring the whole IPv6 address space but limited by current computational power. (…) In this paper, we introduce our approach 6VecLM to explore achieving such target generation algorithms. The architecture can map addresses into a vector space to interpret semantic relationships and uses a Transformer network to build IPv6 language models for predicting address sequence. (…)
3. Estimating Precisions for Multiple Binary Classifiers Under Limited Samples
Paper abstract: Machine learning classifiers often require regular tracking of their performance measures such as precision, recall, etc., for model improvement and diagnostics. (…) We propose a sampling method to estimate the precisions of multiple binary classifiers that exploits the overlaps between their prediction sets. We provide theoretical guarantees that our estimators are unbiased and empirically demonstrate that the precision metrics estimated from our sampling technique are as good (in terms of variance and confidence interval) as those obtained from a uniform random sample. (…)
4. Neural User Embedding From Browsing Events
Paper abstract: The deep understanding of online users on the basis of their behavior data is critical to providing personalized services to them. However, the existing methods for learning user representations are usually based on supervised frameworks such as demographic prediction and product recommendation. (…). Motivated by the success of pretrained word embeddings in many natural language processing (NLP) tasks, we propose a simple but effective neural user-embedding approach to learn the deep representations of online users by using their unlabeled behavior data. Once the users are encoded to low-dimensional dense embedding vectors, these hidden user vectors can be used as additional user features in various user-involved tasks, such as demographic prediction, to enrich user representation. (…)
Computational social science
1. Spatial Community-Informed Evolving Graphs for Demand Prediction
Paper abstract: The rapidly increasing number of sharing bikes has facilitated people’s daily commuting significantly. However, the number of available bikes in different stations may be imbalanced due to the free check-in and check-out of users. (…) To tackle these challenges, we propose a novel Spatial Community-informed Evolving Graphs (SCEG) framework to predict station-level demands, which considers two different grained interactions. Specifically, we learn time-evolving representation from fine-grained interactions in evolving station networks using EvolveGCN. (…)
2. A Deep Dive into Multilingual Hate Speech Classification
Paper abstract: Hate speech is a serious issue that is currently plaguing the society and has been responsible for severe incidents such as the genocide of the Rohingya community in Myanmar. Social media has allowed people to spread such hateful content even faster. This is especially concerning for countries which lack hate speech detection systems. In this paper, using hate speech dataset in 9 languages from 16 different sources, we perform the first extensive evaluation of multilingual hate speech detection. We analyze the performance of different deep learning models in various scenarios. (…)
3. Semi-Supervised Multi-aspect Detection of Misinformation using Hierarchical Joint Decomposition
Paper abstract: Distinguishing between misinformation and real information is one of the most challenging problems in today’s interconnected world. The vast majority of the state-of-the-art in detecting misinformation is fully supervised, requiring a large number of high-quality human annotations. (…) In this work, we are interested in exploring scenarios where the number of annotations is limited. In such scenarios, we investigate how tapping on a diverse number of resources that characterize a news article, henceforth referred to as “aspects” can compensate for the lack of labels. (…)
4. Model Bridging: Connection between Simulation Model and Neural Network
Paper abstract: The interpretability of machine learning, particularly for deep neural networks, is crucial for decision making in real-world applications.One approach is replacing un-interpretable machine learning model with a surrogate model, which has a simple structure for interpretation. Another approach is understanding the target system by using a simulation modeled by human knowledge with interpretable simulation parameters. (…) Our idea is to use a simulation model as an interpretable surrogate model. However, the computational cost of simulator calibration is high owing to the complexity of the simulation model.Thus, we propose a “model-bridging” framework to bridge machine learning models with simulation models by a series of kernel mean embeddings to address these difficulties. (…)
E-commerce and finance
1. Fashion Outfit Generation for E-commerce
Paper abstract: The task of combining complimentary pieces of clothing into an outfit is familiar to most people, but has thus far proved difficult to automate. We present a model that uses multimodal embeddings of pieces of clothing based on images and textual descriptions. The embed-dings and a shared style space are trained end to end in a novel deep neural network architecture. The network is trained on the largest and richest labelled outfit dataset made available to date, which we opensource. (…)
2. Measuring Immigrants Adoption of Natives Shopping Consumption with Machine Learning
Paper abstract: “Tell me what you eat and I will tell you what you are”. Jean Anthelme Brillat-Savarin was among the firsts to recognize the relationship between identity and food consumption. Food adoption choices are much less exposed to external judgment and social pressure than other individual behaviours, and can be observed over a long period. That makes them an interesting basis for, among other applications, studying the integration of immigrants from a food consumption viewpoint. Indeed, in this work we analyze immigrants’ food consumption from shopping retail data for understanding if and how it converges towards those of natives. (…)
3. Why did my Consumer Shop? Learning an Efficient Distance Metric for Retailer Transaction Data
Paper abstract: Transaction analysis is an important part in studies aiming to understand consumer behaviour. (…) In this paper we propose a new distance metric that is retailer independent by design, allowing cross-retailer and cross-country analysis. The metric comes with a novel method of finding the importance of categories of products, alternating between unsupervised learning techniques and importance calibration. (…)
4. A 3D-Advert Creation System for Product Placements
Paper abstract: Over the past decade, the evolution of video-sharing platforms has attracted a significant amount of investments on contextual advertising. The common contextual advertising platforms utilize the information provided by users to integrate 2D visual ads in videos. (…) This paper presents a Video Advertisement Placement & Integration (Adverts) framework, which is capable of perceiving the 3D geometry of the scene and camera motion to blend 3D virtual objects in videos and create the illusion of reality. (…)
5. Detecting and predicting evidences of insider trading in the Brazilian market
Paper abstract: Insider trading is known to negatively impact market risk and is considered a crime in many countries. The rate of enforcement however varies greatly. In Brazil especially very few legal cases have been pursued and a dataset of previous cases is, to the best of our knowledge, nonexistent. In this work, we consider the Brazilian market and deal with two problems. Firstly we propose a methodology for creating a dataset of evidences of insider trading. (…) Secondly, we use our dataset in an attempt to recognise suspicious negotiations before relevant events are disclosed. (…)
1. Energy consumption forecasting using a stacked nonparametric Bayesian approach
Paper abstract: The process of forecasting household energy consumption is studied within the framework of the nonparametric Gaussian Process (GP), using multiple short time series data. As we begin to use smart meter data to paint a clearer picture of Australian residential electricity use, it becomes increasingly apparent that we must also construct a detailed picture and understanding of Australia’s complex relationship with gas consumption. (…) Considering these facts, we construct a stacked GP model where the predictive posteriors of each GP applied to each task are used in the prior and likelihood of the next level GP. We apply our model to a real world dataset to forecast energy consumption in Australian households across several states. (…)
2. Long-term pipeline failure prediction using nonparametric survival analysis
Paper abstract: Australian water infrastructure is more than a hundred years old, thus has begun to show its age through water main failures. Our work concerns approximately half a million pipelines across major Australian cities that deliver water to houses and businesses, serving over five million customers. (…) We applied Machine Learning techniques to find a cost-effective solution to the pipe failure problem in these Australian cities, where on average 1500 of water main failures occur each year. To achieve this objective, we construct a detailed picture and understanding of the behaviour of the water pipe network (…).
3. Lagrangian Duality for Constrained Deep Learning
Paper abstract: This paper explores the potential of Lagrangian duality for learning applications that feature complex constraints. Such constraints arise in many science and engineering applications, where the task amounts to learning optimization problems which must be solved repeatedly and include hard physical and operational constraints. The paper also considers applications where the learning task must enforce constraints on the predictor itself, either because they are natural properties of the function to learn or because it is desirable from a societal standpoint to impose these constraints. (…)
4. CrimeForecaster: Crime Prediction by Exploiting the Neighborhoods’ Spatiotemporal Dependencies
Paper abstract: Crime prediction in urban areas can improve the allocation of resources (e.g., police patrols) towards a safer society. Recently, researchers have been using deep learning frameworks for urban crime forecasting with better accuracies as compared to previous work. (…) In this paper, we design and implement an end-to-end spatiotemporal deep learning framework, dubbed CrimeForecaster, which captures both the temporal recurrence and the spatial dependency simultaneously within and across regions. (…)
I personally recommend to also go to the event web site and explore your favourite topics in greater depth.
Note that we also published posts on top research papers from the conference. Have a look here.
I would be happy to extend this list, as it’s my subjective selection. Feel welcomed to suggest more papers. If you feel that there is something cool missing, simply let me know, and I will extend this post.
MLOps: What It Is, Why it Matters, and How To Implement It (from a Data Scientist Perspective)
13 mins read | Prince Canuma | Posted January 14, 2021
According to techjury, we have produced 10x more data in 2020 compared to 2019. For data scientists like you and me, that is like early Christmas because there are so many theories/ideas to explore, experiment with, and many discoveries to be made and models to be developed.
But if we want to be serious and actually have those models touch real-life business problems and real people, we have to deal with the essentials like:
- acquiring & cleaning large amounts of data;
- setting up tracking and versioning for experiments and model training runs;
- setting up the deployment and monitoring pipelines for the models that do get to production.
And we need to find a way to scale our ML operations to the needs of the business and/or users of our ML models.
There were similar issues in the past when we needed to scale conventional software systems so that more people can use them. DevOps’ solution was a set of practices for developing, testing, deploying, and operating large-scale software systems. With DevOps, development cycles became shorter, deployment velocity increased, and system releases became auditable and dependable.
That brings us to MLOps. It was born at the intersection of DevOps, Data Engineering, and Machine Learning, and it’s a similar concept to DevOps, but the execution is different. ML systems are experimental in nature and have more components that are significantly more complex to build and operate.
Let’s dig in!Continue reading ->