We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Building and Managing Data Science Pipelines with Kedro

Data science promises to generate immense business value across all industries through the compelling capabilities of machine learning. However, a recent report by Gartner revealed that most data science projects fail to progress beyond experimentation despite having sufficient data and intent.

To unlock the full potential of data science, machine learning models need to be deployed in the real world as scalable end-to-end systems that are managed automatically. The desire to maximize data science impact has led to the rise of Machine Learning Operations (MLOps), a set of best practices for building, maintaining, and monitoring machine learning production systems.

As part of MLOps, data science pipelines form the foundation from which machine learning models are effectively deployed and used. This article explores the concepts behind data science pipelines (focusing on machine learning) and how to leverage Kedro, an open-source framework, for creating one.

What is a data science pipeline?

As the name suggests, a data science pipeline involves the seamless linkage of various components to facilitate the smooth movement of data as intended.

If we were to do an online search for data science pipelines, we would see a dizzying array of pipeline designs out there. The good news is that we can boil down these pipelines into these six core elements:

  • 1Data retrieval and ingestion
  • 2Data preparation
  • 3Model training
  • 4Model evaluation and tuning
  • 5Model deployment
  • 6Monitoring
Data science pipeline
Data science pipeline | Source: Author

The above diagram illustrates how these six components are connected to form a pipeline where machine learning models are primed to deliver the best results in production settings.

Let’s take a closer look at each of these components:

(1) Data retrieval and ingestion

  • Data is the lifeblood of all data science projects, so the first step is to identify the relevant raw data from various data sources. 
  • This step is more challenging than it sounds, as data is often stored in different formats across different silos (e.g., third-party sources, internal databases).
  • Once the required datasets are correctly identified, they are extracted and consolidated into a central location for downstream processing.

(2) Data preparation

  • The quality of insights from data depends on the data quality itself. Therefore it is no surprise that data preparation takes up the most time and effort. 
  • The techniques used for data preparation are based on the task at hand (e.g., classification, regression, etc.) and includes steps such as data cleaning, data transformations, feature selection, and feature engineering.

(3) Model training

We are now ready to run machine learning on the training dataset with the data prepared. 

  • Model training is where the model prowls through the data and learns the underlying pattern. The trained model will be represented as a statistical function that captures the pattern information from the data.
  • The selection of machine learning models to implement is dependent on the actual task, nature of the data, and business requirements.

(4) Model evaluation and tuning

  • Once model training is complete, it is vital to evaluate its performance. The evaluation is done by tasking the model to run predictions on a dataset that it has not seen before. It serves as a proxy for how well it might perform in the real world.
  • The evaluation metrics help guide the changes needed to optimize model performance (e.g., select different models, tune hyperparameter configurations, etc.).
  • The machine learning development cycle is highly iterative because there are many ways to adjust the model based on the metrics and error analysis.
Example of a model management system using Neptune.ai
Example of a model management system using Neptune.ai | Source

(5) Deployment

  • Once we are confident that our model can deliver excellent predictions, we expose the model to real action by deploying it into production settings.
  • Model deployment (aka serving) is the critical step of integrating the model into a production environment where it takes in actual data and generates output for data-driven business decisions.

(6) Monitoring

While it may appear that we have reached the finishing line with successful model deployment, we are still some way from being done.

  • To maintain a robust and continuously operating data science pipeline, it is of utmost importance that we monitor how well it is performing after deployment.
  • Beyond model performance and data quality, the monitoring metrics can also include operational aspects such as resource utilization and model latency.
  • In a mature MLOps setup, we can trigger new iterations of model training based on predictive performance or the availability of new data.
Arize-drift-monitor
Example drift monitor in Arize | Source

An important point to highlight is that data science pipelines are dynamic and need to be iteratively improved to ensure continuous robustness and accuracy.

In the earlier diagram of the six core pipeline components, we saw several dotted arrows connecting back to prior elements of the pipeline. These arrows reflect the various iterations needed as we update the components in response to changes in monitoring metrics.

For example, suppose the age distribution of the input customer data has been increasingly skewed in recent weeks. In that case, the team should consider reviewing the data preparation steps or retraining the model on the new data.

Importance of data science pipelines

It is vital to appreciate the importance and benefits of data science pipelines in the first place because they require effort to build.

The goal of these pipelines is to create a systematic workflow where raw data is transformed into actionable business insights in an automated and repeatable fashion.

Its importance comes from automating manual steps in the data science development cycle, which are repetitive, labour-intensive, error-prone, and time-consuming.

This streamlining of data movement allows data scientists to focus on what they do best i.e. improving data and model, without worrying about the engineering aspects of data flow, algorithm updates, and model deployment.

Here are the business benefits that data science pipelines can bring:

  • Speeds up data-driven decision-making to respond to evolving business needs and customer preferences.
  • Unlocks new opportunities and comprehensive insights via data consolidation from silos into a single destination.
  • Ensures consistency, reproducibility, and quality of data insights.
  • Simplifies the introduction of new business ideas and requirements into a machine learning system.

Data science pipeline use cases

Data science pipelines are industry-agnostic, so we can expect them to deliver huge benefits across different fields potentially.

Here are some real-world examples of data science pipelines implemented in the industry:

  • Software Industry —  Dropbox built a modern optical character recognition pipeline to create a mobile document scanning feature.
Creating a modern ocr pipeline using computer vision and deep learning
Creating a modern OCR pipeline using computer vision and deep learning | Source
  • Transportation Industry —  Lyft utilized an in-house pipeline framework to accelerate machine learning and data orchestration for core ride-hailing functions of pricing, locations, and arrival time estimations.
  • Insurance Industry —  USAA leveraged a machine learning pipeline to improve motor claims processing by creating a service that maps a photo of damaged vehicle parts that should be repaired or replaced.
  • Healthcare Industry —  The UK’s National Health System (NHS) developed and deployed a suite of machine learning pipelines as part of a national hospital capacity planning system for COVID-19. 

What is Kedro?

Having seen the immense value that data science pipelines can bring, let’s explore how to implement them and turn these theoretical benefits into systematic reality. 

The recognition of pipelines’ importance has spurred the development of several frameworks for building and managing pipelines effectively. One such framework is Kedro, which is the focus of this article.

Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. This framework helps to accelerate data pipelining, enhance data science prototyping, and promote pipeline reproducibility.

Kedro applies software engineering concepts to developing production-ready machine learning code to reduce the time and effort needed for successful model deployment.

Its impact is achieved by eliminating re-engineering work from low-quality code and standardization of project templates for seamless collaborations.

Let’s take a look at the applied concepts within Kedro:

  • Reproducibility: Ability to recreate the steps of a workflow across different pipeline runs and environments accurately and consistently.
  • Modularity: Breaking down large code chunks into smaller, self-contained, and understandable units that are easy to test and modify.
  • Maintainability: Use of standard code templates that allow teammates to readily comprehend and maintain the setup of any project, thereby promoting a standardized approach to collaborative development
  • Versioning: Precise tracking of the data, configuration, and machine learning model used in each pipeline run of every project.
  • Documentation: Clear and structured code information for easy reading and understanding.
  • Seamless Packaging: Allowing data science projects to be documented and shipped efficiently into production (with tools such as Airflow or Docker).

Why Kedro?

The path to bringing a data science project from pilot development to production is fraught with challenges.

Some of the significant difficulties include:

  • Codes that need to be rewritten for production environments, leading to significant project delays.
  • Project structures can be disorganized and incoherent, making it challenging for collaboration.
  • Data flow that is hard to trace.
  • Functions that are overly lengthy and difficult to test or reuse.
  • Relationships between functions that are hard to understand. 

The QuantumBlack team developed Kedro to tackle the above challenges. It is born out of the collective belief that data science code should be production-ready from the get-go as it serves as a disciplined starting point for successful pipeline deployment.

Kedro in the real world

As with all projects, the proof is in the pudding. Here are some examples of Kedro being successfully used in real-world applications:

  • NASA utilized Kedro as part of their cloud-based predictive engine that predicts impeded and unimpeded taxi out duration within the airspace.
  • JungleScout sped up the training and review of its sales estimation models 18 times with the help of Kedro.
  • Telkomsel uses Kedro to run hundreds of feature engineering tasks and serve dozens of machine learning models in their production environment.
  • ElementAI improved their work efficiency by leveraging Kedro in their scheduling software to measure historical performance and create replay scenarios.

Building a data science pipeline for anomaly detection

Now that we understand Kedro, let us get to the exciting part where we work through a practical hands-on example.

The project use case revolves around financial fraud detection. We will build an anomaly detection pipeline to identify anomalies in credit card transactions using isolation forest as our machine learning model.

The credit card transaction data is obtained from the collaboration between Worldline and Machine Learning Group. It is a realistic simulation of real-world credit card transactions and has been designed to include complicated fraud detection issues.

The following visualization shows what our final anomaly detection pipeline will look like and serves as a blueprint for what we will build in the following sections.

Anomaly detection pipeline
Anomaly detection pipeline | Source: Author

Feel free to check out this project’s GitHub repo to follow along with this walkthrough.

Pipelines with Kedro

Step 1: Installing Kedro and Kedro-Viz

It is recommended to create a virtual environment so that each project has its isolated environment with the relevant dependencies. 

To work with Kedro, the official Kedro documentation recommends that users download and install Anaconda

Because my Python version is >3.10, Anaconda makes it easy to create an environment (using conda instead of venv) on a version that is compatible with Kedro’s requirements (i.e., Python 3.6 – 3.8 at point of writing). 

In particular, this is the command (in Anaconda Powershell Prompt) to generate our Kedro environment:

conda create --name kedro-env python=3.7 -y

Once the virtual environment is set up and activated with conda activate kedro-env, we can use pip to install Kedro and the Kedro-Viz plugin:

pip install kedro kedro-viz

We can check whether Kedro is properly installed by changing the directory to our project folder and then entering kedro info. If installed correctly, we should see the following output:

Pipelines with Kedro

At this point, we can install the other packages needed for our project:

pip install scikit-learn matplotlib

If we wish to initialize this project as a Git repository, we can do so with:

git init

Step 2: Project setup

One of the key features of Kedro is the creation of standard, modifiable and easy-to-use project templates. We can easily initialize a new Kedro project with:

kedro new

After providing the relevant names to the series of prompts, we will end up with a highly-organized project directory that we can build upon:

Pipelines with Kedro

The project structure can be grouped into six main folders:

  • /conf: Contains configuration files that specify vital details such as data sources (i.e. Data Catalog), model parameters, credentials, and logging information.
  • /data: Contains the input, intermediate and output data. It is organized into an eight-layer data engineering convention to clearly categorize and structure how data is processed.
  • /docs: Contains the files relating to the project documentation.
  • /logs: Contains the log files generated when pipeline runs are performed.
  • /notebooks: Contains Jupyter notebooks used in the project e.g., for experimentation or initial exploratory data analysis.
  • /src: Contains the source codes for the project, such as Python scripts for the pipeline steps, data processing, and model training.

Step 3: Data setup

Data comes before science, so let us start with the data setup. In this project, the raw data (70 CSV files of daily credit card transactions) is first placed inside the data/01_raw folder.

We already know what data will be generated and utilized along the pipeline based on our project blueprint described earlier. Therefore, we can translate this information into the Data Catalog, a registry of data sources available for the project. 

The Data Catalog provides a consistent way of defining how data is stored and parsed, making it easy for datasets to be loaded and saved from anywhere within the pipeline.

We can find the Data Catalog in the .yml file —  conf/base/catalog.yml.

Pipelines with Kedro

The above image is a snippet of the data sources defined in the Data Catalog. For example, we first expect our raw CSV files to be read and merged into an intermediate CSV dataset called merged_data.csv.

Kedro comes with numerous built-in data connectors (e.g., pandas.CSVDataSet, matplotlib.MatplotlibWriter) to accommodate the different data types out there.

Step 4: Create pipelines

Once our Data Catalog is appropriately defined, we can build our pipelines. Firstly, there are two key concepts to understand: Nodes and Pipelines.

  • Nodes are the building blocks of pipelines. They are essentially Python functions that represent the data transformations to be performed, e.g., data pre-processing, modelling.
  • Pipelines are sequences of nodes that are connected to deliver a workflow. It organizes the nodes’ dependencies and execution order and connects inputs and outputs while keeping the code modular.

The complete pipeline for anomaly detection can be divided into three smaller modular pipelines, which we will eventually connect:

  • 1Data engineering pipeline
  • 2Data science pipeline
  • 3Model evaluation pipeline

We can instantiate these modular pipelines with the following commands based on the names we assign:

kedro pipeline create data_engineering
kedro pipeline create data_science
kedro pipeline create model_evaluation

While the pipelines are empty at this stage, their structures have been nicely generated inside the /src folder:

Pipelines with Kedro

Each pipeline folder has the same files, including the nodes.py (codes for the nodes) and pipeline.py (codes for the pipeline).

Step 5:  Build a data engineering pipeline

Let’s first look at the data engineering pipeline, where we process the data to make it suitable for downstream machine learning. More specifically, there are three preprocessing tasks to be performed:

  • 1Merge raw datasets into an intermediate merged dataset.
  • 2Process the merged dataset by keeping only the predictor columns and creating a new date column for subsequent train-test split.
  • 3Perform chronological 80:20 train-test split and drop unnecessary columns.

We start by scripting the tasks as three separate node functions inside nodes.py, as shown below:

# src/anomaly_detection_pipeline_kedro/pipelines/data_engineering/nodes.py

from typing import Any, Callable, Dict
import pandas as pd
from datetime import timedelta, datetime as dt

def merge_data(partitioned_input: Dict[str, Callable[[], Any]]) -> pd.DataFrame:
    merged_df = pd.DataFrame()
    for partition_id, partition_load_func in sorted(partitioned_input.items()):
        partition_data = partition_load_func()  # Load partition data
        merged_df = pd.concat([merged_df, partition_data], ignore_index=True, sort=True) 
    return merged_df

def process_data(merged_df: pd.DataFrame, predictor_cols: list) -> pd.DataFrame:
    merged_df['TX_DATETIME'] =  pd.to_datetime(merged_df['TX_DATETIME'], infer_datetime_format=True)
    merged_df['TX_DATE'] = merged_df['TX_DATETIME'].dt.date
    processed_df = merged_df[predictor_cols]
    return processed_df

def train_test_split(processed_df: pd.DataFrame) -> pd.DataFrame:
    processed_df['TX_DATE'] =  pd.to_datetime(processed_df['TX_DATE'], infer_datetime_format=True)
    split_date = processed_df['TX_DATE'].min() + timedelta(days=(8*7))
    train_df = processed_df.loc[processed_df['TX_DATE'] <= split_date]
    test_df = processed_df.loc[processed_df['TX_DATE'] > split_date]
    train_df.drop(columns=['TX_DATE'], inplace=True)
    test_df.drop(columns=['TX_DATE'], inplace=True)

    # Drop actual label if any (supposed to be unsupervised training)
    if 'TX_FRAUD' in train_df.columns:
        train_df = train_df.drop(columns=['TX_FRAUD'])
    if 'TX_FRAUD' in test_df.columns:
        test_labels = test_df[['TX_FRAUD']] # Store test labels (if any)
        test_df = test_df.drop(columns=['TX_FRAUD'])
    else:
        test_labels = pd.DataFrame() # Empty dataframe if no test label
    return train_df, test_df, test_labels

We then import these node functions into pipeline.py to link them in the correct sequence.

# src/anomaly_detection_pipeline_kedro/pipelines/data_engineering/pipeline.py

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import merge_data, process_data, train_test_split

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(
            func=merge_data,
            inputs="raw_daily_data",
            outputs="merged_data",
            name="node_merge_raw_daily_data"
            ),
        node(
            func=process_data,
            inputs=["merged_data", "params:predictor_cols"],
            outputs="processed_data",
            name="node_process_data"
            ),
        node(
            func=train_test_split,
            inputs="processed_data",
            outputs=["train_data", "test_data", "test_labels"],
            name="node_train_test_split"
            ),
    ])

Notice that in each of the node wrapper node(..) above, we specify a node name, function (imported from node.py), and the input and output datasets defined in the Data Catalog.

The arguments in the node wrappers should match the dataset names in the Data Catalog and the arguments of the node functions.

For the node node_process_data, the list of predictor columns is stored in the parameters file found in conf/base/parameters.yml.

Our data engineering pipeline setup is complete, but it is not ready since it is not yet registered. We will explore this later in Step 8, so let’s continue building the two remaining pipelines.

Step 6:  Build a data science pipeline

The anomaly detection model for our data science pipeline is an isolation forest. Isolation forest is an unsupervised algorithm that is built using decision trees.

It ‘isolates’ observations by randomly selecting a feature, and then choosing a split value between its maximum and minimum values. Since anomalies are few and different, they are expected to be easier to isolate than normal observations.

We will use the scikit-learn isolation forest implementation for the modeling. There are two tasks (and nodes) to be created — (i) model training and (ii) model predictions (aka inference).

The contamination value for the isolation forest model is set at 0.009, reflecting the proportion of fraud cases observed in the original dataset (i.e., 0.9%).

# src/anomaly_detection_pipeline_kedro/pipelines/data_science/nodes.py

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

def train_model(train_df: pd.DataFrame, contamination_value: float):
    clf = IsolationForest(random_state=42, bootstrap=True, contamination=contamination_value)
    clf.fit(train_df.values) 
    return clf

def predict(ml_model, test_df: pd.DataFrame):
    preds = ml_model.predict(test_df.values) 
    preds_mod = np.array(list(map(lambda x: 1*(x==-1), preds)))
    
    anomaly_scores = ml_model.score_samples(test_df) 
    anomaly_scores_mod = np.array([-x for x in anomaly_scores])

    test_df['ANOMALY_SCORE'] = anomaly_scores_mod
    test_df['ANOMALY'] = preds_mod
    return test_df


Like before, we link the nodes together within a pipeline function in pipeline.py.

# src/anomaly_detection_pipeline_kedro/pipelines/data_science/pipeline.py

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import train_model, predict

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(
            func=train_model,
            inputs=["train_data", "params:contamination_value"],
            outputs="ml_model",
            name="node_train_model"
            ),
        node(
            func=predict,
            inputs=["ml_model", "test_data"],
            outputs="predictions",
            name="node_predict"
            ),
    ])

As seen earlier in the Data Catalog, we will be saving our trained isolation forest model as a pickle file in data/06_models.

Step 7:  Build a model evaluation pipeline

Although isolation forest is an unsupervised model, we can still evaluate its performance if we have ground truth labels.

In the original dataset, there is the TX_FRAUD variable that serves as an indicator of fraudulent transactions.

With the ground truth labels and anomaly scores from the model predictions, we can obtain the evaluation metrics readily, which will be presented as AUC and AUCPR plots.

# src/anomaly_detection_pipeline_kedro/pipelines/model_evaluation/nodes.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, precision_recall_curve, auc

def evaluate_model(predictions: pd.DataFrame, test_labels: pd.DataFrame):
    def get_auc(labels, scores):
        fpr, tpr, thr = roc_curve(labels, scores)
        auc_score = auc(fpr, tpr)
        return fpr, tpr, auc_score

    def get_aucpr(labels, scores):
        precision, recall, th = precision_recall_curve(labels, scores)
        aucpr_score = np.trapz(recall, precision)
        return precision, recall, aucpr_score

    def plot_metric(ax, x, y, x_label, y_label, plot_label, style="-"):
        ax.plot(x, y, style, label=plot_label)
        ax.legend()
        ax.set_ylabel(x_label)
        ax.set_xlabel(y_label)

    def prediction_summary(labels, predicted_score, info, plot_baseline=True, axes=None):
        if axes is None: axes = [plt.subplot(1, 2, 1), plt.subplot(1, 2, 2)]
        fpr, tpr, auc_score = get_auc(labels, predicted_score)
        plot_metric(axes[0], fpr, tpr, "False positive rate", "True positive rate", "{} AUC={:.4f}".format(info, auc_score))
        if plot_baseline:
            plot_metric(axes[0], [0, 1], [0, 1], "False positive rate", "True positive rate", "Baseline AUC=0.5", "r--")
        precision, recall, aucpr_score = get_aucpr(labels, predicted_score)
        plot_metric(axes[1], recall, precision, "Recall", "Precision", "{} AUCPR={:.4f}".format(info, aucpr_score))
        if plot_baseline:
            thr = sum(labels)/len(labels)
            plot_metric(axes[1], [0, 1], [thr, thr], "Recall", "Precision", "Baseline AUCPR={:.4f}".format(thr), "r--")
        plt.show()
        return axes
    
    fig = plt.figure()
    fig.set_figheight(4.5)
    fig.set_figwidth(4.5*2)
    axes = prediction_summary(test_labels['TX_FRAUD'].values, predictions['ANOMALY_SCORE'].values, "Isolation Forest")
    return fig

Here is the pipeline.py script to run the model evaluation node.

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import evaluate_model

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(
            func=evaluate_model,
            inputs=["predictions", "test_labels"],
            outputs="evaluation_plot",
            name="node_model_evaluation"
            )
    ])

This model evaluation step is separated from the data science pipeline seen in Step 6. This separation is because we are using an unsupervised anomaly detection algorithm, and we do not expect to always have ground truth data.

Step 8:  Registering all pipelines in the pipeline registry

At this point, all the hard work in pipeline creation has been accomplished. We now need to conclude by importing and registering all three modular pipelines in the pipeline registry.

from typing import Dict
from kedro.pipeline import Pipeline, pipeline

from anomaly_detection_pipeline_kedro.pipelines import (
    data_engineering as de,
    data_science as ds,
    model_evaluation as me
)

def register_pipelines() -> Dict[str, Pipeline]:
    data_engineering_pipeline = de.create_pipeline()
    data_science_pipeline = ds.create_pipeline()
    model_evaluation_pipeline = me.create_pipeline()

    return {
        "de": data_engineering_pipeline,
        "ds": data_science_pipeline,
        "me": model_evaluation_pipeline,
        "__default__": data_engineering_pipeline + data_science_pipeline + model_evaluation_pipeline
    }

The __default__ line in the return statement indicates the default sequence of modular pipelines to run, which in our case is all three modular pipelines — data_engineering, data_science, and model_evaluation.

The beauty of Kedro is that its modular structure gives us flexibility in structuring our pipeline. For example, if we do not have ground truth labels, we can exclude model_evaluation from the default pipeline run. 

Step 9: Visualize the pipeline

Before running the pipeline, it would be a good idea to examine what we have built so far. The fantastic Kedro-Viz plugin allows us to visualize the entire pipeline structure and dependencies readily.

Given its ease of use, clarity, and aesthetic display, it is no surprise that many QuantumBlack clients expressed their delight at this feature.

We can easily generate the visualization with this command:

kedro viz

A new tab will open in our browser, and we will be greeted with a beautiful visualization tool to explore our pipeline structure. This visualization can also be easily exported as a .png image file.

Pipelines with Kedro

Step 10:  Run the pipeline

We are finally ready to run our pipeline. The following command will execute the default pipeline that we have registered earlier.

kedro run

Upon running, the pipeline will populate the respective directories with the generated data, including the anomaly predictions and model evaluation plots.

We can also run specific pipelines which we have registered in the pipeline registry. For example, if we wish only to run the data engineering modular pipeline (de), we can add –pipeline <NAME> to the command:

kedro run --pipeline de

Step 11:  Evaluating pipeline output

Finally, it is time to assess the output of our anomaly detection pipeline. In particular, let us review the matplotlib evaluation plots (saved in data/08_reporting) to see how performant the model is.

Evaluating pipeline output
Evaluating pipeline output | Source: Author

From the plots, we can see that the isolation forest model AUC is 0.8486, which is a pretty good baseline machine learning model performance.

Additional features

Congratulations on making it this far and successfully creating an anomaly detection pipeline with Kedro! 

Beyond the fundamental functionalities detailed earlier, Kedro comes with other useful features for managing data science projects. Here are several capabilities worth mentioning:

(1) Experiment tracking

Kedro makes it easy to set up experiment tracking, and access logged metrics from each pipeline run. Besides its internal experiment tracking capabilities, Kedro integrates well with other MLOps services.

For example, the Kedro-Neptune plugin lets users enjoy the benefits of a nicely organized pipeline together with a powerful Neptune user interface built for metadata management.

Kedro pipeline metadata in a custom dashboard in the Neptune UI
Kedro pipeline metadata in a custom dashboard in the Neptune UI | Source

(2) Pipeline slicing

The pipeline slicing capabilities of Kedro allow us to execute specific portions of the pipeline as we desire. For example, we can define the start and finish nodes in the pipeline slice that we wish to run:

kedro run --from-nodes train-test-split --to-nodes train_model

(3) Project documentation, packaging, and deployment

We can generate project-specific documentation (built on the Sphinx framework) by running this command in the project’s root directory.

kedro build-docs

Next, to initiate the packaging of the project as a Python library, we run the following command:

kedro package

Lastly, we can deploy these packaged data science pipelines via first-party plugins such as Docker and Airflow.

(4) Access Kedro through Jupyter Notebooks

We can also use Jupyter notebooks to explore the project data or experiment with the code to create new nodes for a pipeline. These tasks can be done by starting a Kedro Jupyter session with:

kedro jupyter notebook

Conclusion

Here is what we have gone through in this article:

  • What are the concepts, benefits, and use cases of data science pipelines.
  • What is Kedro, and how it can be used to build and manage data science pipelines.
  • How to create an anomaly detection pipeline for credit card transaction data using the isolation forest model and Kedro framework.

There are numerous other interesting features and tutorials available, so check out the official Kedro documentation for further exploration. Also, feel free to explore the GitHub repo containing all the codes of this project.

References


READ NEXT

The Best MLOps Tools and How to Evaluate Them

12 mins read | Jakub Czakon | Updated August 25th, 2021

In one of our articles—The Best Tools, Libraries, Frameworks and Methodologies that Machine Learning Teams Actually Use – Things We Learned from 41 ML Startups—Jean-Christophe Petkovich, CTO at Acerta, explained how their ML team approaches MLOps.

According to him, there are several ingredients for a complete MLOps system:

  • You need to be able to build model artifacts that contain all the information needed to preprocess your data and generate a result. 
  • Once you can build model artifacts, you have to be able to track the code that builds them, and the data they were trained and tested on. 
  • You need to keep track of how all three of these things, the models, their code, and their data, are related. 
  • Once you can track all these things, you can also mark them ready for staging, and production, and run them through a CI/CD process. 
  • Finally, to actually deploy them at the end of that process, you need some way to spin up a service based on that model artifact. 

It’s a great high-level summary of how to successfully implement MLOps in a company. But understanding what is needed in high-level is just a part of the puzzle. The other one is adopting or creating proper tooling that gets things done. 

That’s why we’ve compiled a list of the best MLOps tools. We’ve divided them into six categories so you can choose the right tools for your team and for your business. Let’s dig in!

Continue reading ->
Feature store and data ingestion mlops

How to Solve the Data Ingestion and Feature Store Component of the MLOps Stack

Read more
ML pipeline problems solutions

Building ML Pipeline: 6 Problems & Solutions [From a Data Scientist’s Experience]

Read more
Recommender system lessons

Recommender Systems: Lessons From Building and Deployment

Read more
MLOps pillars

Pillars of MLOps and How to Implement Them

Read more