Blog » ML Tools » A Quickstart Guide to Auto-Sklearn (AutoML) For Machine Learning Practitioners

A Quickstart Guide to Auto-Sklearn (AutoML) For Machine Learning Practitioners

Using AutoML frameworks in the real world is becoming a regular thing for machine learning practitioners. People often ask: does automated machine learning (AutoML) replace data scientists? 

Not really. If you’re eager to find out what AutoML is and how it works, join me in this article. I’m going to show you auto-sklearn, a state-of-the-art and open-source AutoML framework.

To do this, I had to do some research:

  • Read the first and second paper for auto-sklearn V1 and V2.
  • Took a deep dive into the auto-sklearn documentation and examples.
  • Checked the official Auto Sklearn blog post.
  • Did some experiments on my own.

As do AutoML research, and I’ve learned quite a lot so far. After reading this post, you’ll know more about:

  • What is AutoML, and who is AutoML for?
  • Why does auto-sklearn matter to the ML community?
  • How to use auto-sklearn in practice?
  • What are the main features of auto-sklearn?
  • A use-case of auto-sklearn with result tracking in Neptune.

Let’s start!

Automated Machine Learning

AutoML is a young field. The AutoML community wants to build an automated workflow that could take raw data as input, and produce a prediction automatically. 

This automated workflow should automatically do preprocessing, model selection, hyperparameter tuning, and all other stages of the ML process. For example, take a look at the image below to see how Microsoft Azure uses AutoML.

AutoML can improve the quality of work for data scientists, it’s not going to remove data scientists from the cycle. 

Experts could use AutoML to increase their job performance by focusing on the best-performing pipelines, and non-experts could use AutoML systems without a broad ML education. If you have 15 minutes to spare, the conversation below might help you understand what AutoML is all about.

What is AutoML: A conversation between Josh Starmer and Ioannis Tsamardinos

AutoML frameworks

There are different types of AutoML frameworks, each has unique features. Each of them has automated a few steps of a full machine learning workflow, from pre-processing to model development. In this table, I summed up only a few of them that are worth mentioning:

Name of AutoML framework Created bydocumentation/GitHubOpen-Source
auto-sklearnMatthias Feurer, et al.GitHub / DocumentationYes
Auto-xgboostJanek Thomas et al.GitHub
GCP-TablesGoogleSourceNo
AutoGluonAmazonGitHub/ SourceYes
AutoML-azureMicrosoftSourceNo
GAMAPieter Gijsbers et al.GitHub/ sourceYes
Auto-WEKAChris Thornton et al.GitHub/ DocumentationYes
H2O AutoMLh2o.aiGitHubYes
TPOTRandal S. Olson, et al.GitHub/ DocumentationYes
ML-PlanMarcel Wever et al.GitHub/ DocumentationYes
hyperopt-sklearnBrent Komer et alGitHub/ DocumentationYes
SmartMLMohamed Maher et al.GitHubYes
MLJARMLJAR TeamGitHub/ DocumentationYes

Auto-sklearn

auto-sklearn is an AutoML framework on top of scikit-Learn. It’s state of the art, and open-source. 

auto-sklearn combines powerful methods and techniques which helped the creators win the first and second international AutoML challenge.

auto-sklearn is based on defining AutoML as a CASH problem. 

CASH = Combined Algorithm Selection and Hyperparameter optimization. Put simply, we want to find the best ML model and its hyperparameter for a dataset among a vast search space, including plenty of classifiers and a lot of hyperparameters. In the figure below, you can see a representation of auto-sklearn provided by its authors.

AutoML system
Image from the auto-sklearn paper

auto-sklearn can solve classification and regression problems. The first version of auto-sklearn was introduced with an article titled “Efficient and robust automated machine learning ” in 2015, at the 28th International Conference on Neural Information Processing Systems. The second version was presented with the paper “auto-sklearn 2.0: The Next Generation” in 2020.

Auto-sklearn features

What can auto-sklearn do for users? It has several valuable features, helpful for both novices and experts. 

By writing just five lines of Python code, beginners can see the prediction, and experts can boost their productivity. Here are some main features of auto-sklearn:

  • Written in Python, on top of the most popular ML library (scikit-learn).
  • Useful for many tasks, such as classification, regression, multi-label classification.
  • Consists of several preprocessing methods (handling missing values, normalizing data).
  • Searches for optimal ML pipelines among a considerable search space (15 classifiers, more than  150 hyperparameters are searched).
  • State of the art thanks to using meta-learning, Bayesian optimization, ensemble techniques.

How does auto-sklearn work?

Auto-sklearn can solve classification and regression problems, but how? There’s a lot that goes into a machine learning pipeline. In general, auto-sklearn V1 has three main components:

  1. Meta-learning
  2. Bayesian optimization
  3. Build ensemble 

So when we want to apply a classification or regression on a new dataset, auto-sklearn starts by extracting its meta-feature to find the similarity of the new dataset to the knowledge base relying on meta-learning. 

In the next step, when the search space shrinks enough through meta-learning, Bayesian optimization will try to find and select the out-performing ML pipelines. In the last step, auto-sklearn will build the ensemble model based on the best ML workflow in the search space.

Auto-sklearn v2: the new generation

Recently the second version of auto-sklearn went public. Let’s review what’s changed in the new generation. Based on the official blog post and original paper, there are four improvements:

  • They allowed each ML pipeline to use an early-stopping strategy inside the whole search space; this feature improved performance on large datasets, but it’s mostly useful for tree-based classifiers.  
  • Improving model selection strategy: one vital step in auto-sklearn is how to select models. In auto sklearn V2, they used a multi-fidelity optimization method such as BOHB. However, they showed that a single model selection is not fit for all types of the problem, and they integrated several strategies. To get familiar with new optimization methods, you can read this article: “HyperBand and BOHB: Understanding State of the Art Hyperparameter Optimization Algorithms.”
  • Building a portfolio instead of using meta-feature to find a similar dataset in the knowledge base. You can see this improvement in the image below.
  • Build an automated policy selection on top of the previous improvements to select the best strategy. 

Auto-sklearn main parameters

Although Auto-sklearn might be able to find an outperforming pipeline without setting any parameters, there are some parameters that you can use to boost your productivity. To check all parameters visit the official page.

Parameter nameDefault valueDescription
load_modelsTrueShow the models after fitting or not
time_left_for_this_task3600It shows how many seconds are left for the task. If you increase it, the chance for better performance will be increased as well.
per_run_time_limitNoneIt shows how many seconds each ML model should spend.
initial_configurations_via
_metalearning
25How many configurations via meta-learning considers hyperparameter optimization. If set 0, this option will be inactive. Also, this parameter is not available in the auto-sklearn V2.
ensemble_size50The number of the models in the ensemble. To disable this parameter, set it to 1.
n_jobs1The number of parallel jobs. For using all processors, set it -1
ensemble_nbest50Number of best models for building an ensemble model. Only works when ensemble_size is more than one.
include_estimatorsNoneIt will use all estimators when there is None. Not available in auto-sklearn V2.
exclude_estimatorsNoneYou can exclude some estimators from the search space. Not available in auto-sklearn V2.
MetricNoneIf you don’t define a metric, it will be selected based on the task. In this article, we define it (autosklearn.metrics.roc_auc).

Now let’s apply what we learned in a case-study, and perform some experiments!

Track Auto-sklearn experiments on Neptune

I made some Notebooks which you can easily download and do the experiments on your own. But to do all the steps together again, you need to:

AutoML Neptune

Check all the experiments in Neptune

First, you need to install auto-sklearn on your machine. Simply use pip3 for this:

pip3 install auto-sklearn

If you get an error, you may need to install dependencies for that, so please check the official installation page. You can also use the notebooks I prepared for you in Neptune. Then run the following command to make sure the installation is done correctly:

import autosklearn
print(autosklearn.__version__)
#  0.12.1

Let’s tackle some classification and regression problems.

Auto-sklearn for classification

For the classification problem, I chose a cherished Kaggle competition – Santander Customer Transaction Prediction. Please download the dataset and select 10000 records randomly. Then follow the experiments in the first notebook:

#load and split dataset into training and test folds
import autosklearn
X_train=None
X_val=None
y_train=None
y_val=None
train=pd.read_csv("./sample_train_Santander.csv")
X=train.drop(["ID_code",'target'],axis=1)
y=train["target"]
X_train,X_val,y_train,y_val = train_test_split(X,y, stratify=y,test_size=0.33, random_state=42)
#define the model
automl = autosklearn.classification.AutoSklearnClassifier()
#train the model
automl.fit(X_train, y_train )
#predict
y_pred=automl.predict_proba(X_val)
# score
score=roc_auc_score(y_val,y_pred[:,1])
print(score)
# show all models
show_modes_str=automl.show_models()
sprint_statistics_str = automl.sprint_statistics()

We also need to define some configurations to gain more insight into auto-sklearn:

ConfigurationsRange valueDescription
time_left_for_this_task[60- 5000]I started the experiments with 60 seconds, and then for each experiment, I increased it to 5000.
metricroc_aucAs the case study is highly imbalance, then I need to change the metric to roc_auc
resampling_strategyCVIn auto-sklearn V1, If I did not define the resampling_strategy, it could not get a good result. But in auto-sklearn V2, it did it automatically.
resampling_strategy_arguments{‘folds’: 5}

To use the above configuration, you could define the automl object as follows:

#define the model
TIME_BUDGET=60
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=TIME_BUDGET,
metric=autosklearn.metrics.roc_auc,
n_jobs=-1,
resampling_strategy='cv',
resampling_strategy_arguments={'folds': 5},
)

#train the model

automl.fit(X_train, y_train )

As I used plenty of different configurations, I just track them on the Neptune. You can see one of them in the image, and check all of them in Neptune.

AutoML Neptune

When you fit the auto-sklearn model, you can check all the best outperforming pipelines with PipelineProfiler (pip install pipelineprofiler). To do that, you need to run the following code:

import PipelineProfiler
# automl is an object Which has already been created.
profiler_data= PipelineProfiler.import_autosklearn(automl)
PipelineProfiler.plot_pipeline_matrix(profiler_data)

Your output should be like this:

neptune_autosklearn output

On the other hand, I also ran some experiments based on auto-sklearn V2. The result was fascinating. You can see the outcome below:

neptune_autosklearn V2

To use auto-sklearn V2, you can use following code:

TIME_BUDGET=60
automl = autosklearn.experimental.askl2.AutoSklearn2Classifier(
time_left_for_this_task=TIME_BUDGET,
n_jobs=-1,
metric=autosklearn.metrics.roc_auc,
)

Auto-sklearn for regression 

The second type of problem which auto-sklearn can solve is regression. I ran some experiments based on the official example in the auto-sklearn documentation.

TIME_BUDGET=60
automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=TIME_BUDGET,
n_jobs=-1
)
automl.fit(X_train, y_train, dataset_name='boston')
y_pred = automl.predict(X_test)
score=r2_score(y_test, y_pred)
print(score)
show_modes_str=automl.show_models()
sprint_statistics_str = automl.sprint_statistics()

print(show_modes_str)
print(sprint_statistics_str)

I just changed the time budget to track the performance based on the time limitation. The image below shows the results.

neptune_autosklearn result

Final thought

Overall, auto-sklearn is still a new technology. Because auto-sklearn is built on top of scikit-learn, many ML practitioners can quickly try it and see how it works. 

The most important advantage of this framework is that it saves a lot of time for experts. The one weakness is that it acts as a black box, and doesn’t say anything about how to make a decision.

All in all, it’s a pretty interesting tool, so it’s worth giving auto-sklearn a look.


READ NEXT

The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

10 mins read | Author Samadrita Ghosh | Updated July 16th, 2021

On a high level, Machine Learning is the union of statistics and computation. The crux of machine learning revolves around the concept of algorithms or models which are in fact statistical estimations on steroids.

However, any given model has several limitations depending on the data distribution. None of them can be entirely accurate since they are just estimations (even if on steroids). These limitations are popularly known by the name of bias and variance

model with high bias will oversimplify by not paying much attention to the training points (e.g.: in Linear Regression, irrespective of data distribution, the model will always assume a linear relationship). 

model with high variance will restrict itself to the training data by not generalizing for test points that it hasn’t seen before (e.g.: Random Forest with max_depth = None).

The issue arises when the limitations are subtle, like when we have to choose between a random forest algorithm and a gradient boosting algorithm or between two variations of the same decision tree algorithm. Both will tend to have high variance and low bias.

This is where model selection and model evaluation come into play!

In this article we’ll talk about:

  • What are model selection and model evaluation?
  • Effective model selection methods (resampling and probabilistic approaches)
  • Popular model evaluation methods
  • Important Machine Learning model trade-offs
Continue reading ->
The Best Machine Learning Frameworks & Extensions For Scikit-learn

The Best ML Frameworks & Extensions For Scikit-learn

Read more
19 Best JupyterLab Extensions for Machine Learning

19 Best JupyterLab Extensions for Machine Learning

Read more
Hyperparameter Tuning in Python: a Complete Guide 2021

Hyperparameter Tuning in Python: a Complete Guide 2021

Read more
Experiment tracking Experiment management

15 Best Tools for ML Experiment Tracking and Management

Read more