One of the most prevalent complaints we hear from ML engineers in the community is how costly and error-prone it is to manually go through the ML workflow of building and deploying models. They run scripts manually to preprocess their training data, rerun the deployment scripts, manually tune their models, and spend their working hours keeping previously developed models up to date.
Building end-to-end machine learning pipelines lets ML engineers build once, rerun, and reuse many times. It lets them focus more on deploying new models than maintaining existing ones. It provides the ability to focus on new models instead of putting too much effort into maintaining existing models.
If all goes well, of course 😉
In this article, you will:
- 1 Explore what the architecture of an ML pipeline looks like, including the components.
- 2 Learn the essential steps and best practices machine learning engineers can follow to build robust, scalable, end-to-end machine learning pipelines.
- 3 Quickly build and deploy an end-to-end ML pipeline with Kubeflow Pipelines on AWS.
- 4 Learn the challenges of building end-to-end ML pipelines and the best practices to build them.
What is a machine learning pipeline?
Machine learning pipelines are composed of a sequence of linked components or steps that define the machine learning workflow to solve specific problems. The pipelines let you orchestrate the steps of your ML workflow that can be automated. The orchestration here implies that the dependencies and data flow between the workflow steps must be completed in the proper order.
You would build a pipeline to:
- Achieve reproducibility in your workflow (running the pipeline repeatedly on similar inputs will provide similar outputs).
- Simplify the end-to-end orchestration of the multiple steps in the machine learning workflow for projects with little to no intervention (automation) from the ML team.
- Reduce the time it takes for data and models to move from the experimentation phase to the production phase.
- Allow your team to focus more on developing new solutions than maintaining existing ones using modular components that offer automation for your workflow.
- Make it easy to reuse components (a specific step in the machine learning workflow) to create and deploy end-to-end solutions that integrate with external systems without rebuilding each time.
Machine learning pipeline vs machine learning platform
The ML pipeline is part of the broader ML platform. It is used to streamline, orchestrate, and automate the machine learning workflow within the ML platform.
Pipelines and platforms are related concepts in MLOps, but they refer to different aspects of the machine learning workflow. An ML platform is an environment that standardizes the technology stack for your ML/AI team and provides tools, libraries, and infrastructure for developing, deploying, and operationalizing machine learning applications.
The platform typically includes components for the ML ecosystem like data management, feature stores, experiment trackers, a model registry, a testing environment, model serving, and model management. It is designed to provide a unified and integrated environment for primarily data scientists and MLEs to develop and deploy models, manage data, and streamline the machine learning workflow.
The architecture of a machine learning pipeline
The machine learning pipeline architecture can be a real-time (online) or batch (offline) construct, depending on the use case and production requirements. To keep concepts simple in this article, you will learn what a typical pipeline looks like without the nuances of real-time or batch constructs.
Semi Koen’s article gives detailed insight into machine learning pipeline architectures.
Common types of machine learning pipelines
In line with the stages of the ML workflow (data, model, and production), an ML pipeline comprises three different pipelines that solve different workflow stages. They include:
- 1 Data (or input) pipeline.
- 2 Model (or training) pipeline.
- 3 Serving (or production) pipeline.
In large organizations, two or more teams would likely handle each pipeline due to its functionality and scale. The pipelines are interoperable to build a working system:
Data (input) pipeline (data acquisition and feature management steps)
This pipeline transports raw data from one location to another. It covers the entire data movement process, from where the data is collected, for example, through data streams or batch processing, to downstream applications like data lakes or machine learning models.
Model training pipeline
This pipeline trains one or more models on the training data with preset hyperparameters. It evaluates them, fine-tunes them, and packages the optimal model before sending it downstream to applications like the model registry or serving pipeline.
This pipeline deploys the model as a prediction (or scoring) service in production and uses another service to enable performance monitoring.
This article classifies the different pipelines as “machine learning pipelines” because they enable ML applications based on their function in the workflow. Moreover, they are interoperable to enable production applications, especially during maintenance (retraining and continuous testing).
You may also like
Elements of a machine learning pipeline
Some pipelines will provide high-level abstractions for these components through three elements:
- Transformer: an algorithm able to transform one dataset into another.
- Estimator: an algorithm trained on a dataset to produce a transformer.
- Evaluator: to examine the accuracy of the trained model.
Components of the machine learning pipeline
A pipeline component is one step in the machine learning workflow that performs a specific task by taking input, processing it, and producing an output. The components comprise implementations of the manual workflow process you engage in for automatable steps, including:
- Data ingestion (extraction and versioning).
- Data validation (writing tests to check for data quality).
- Data preprocessing.
- Model training and tuning, given a select number of algorithms to explore and a range of hyperparameters to use during experimentation.
- Model performance analysis and evaluation.
- Model packaging and registration.
- Model deployment.
- Model scoring.
- Model performance monitoring.
With most tools, the pipeline components will contain executable code that can be containerized (to eliminate dependency issues). Each step can be managed with an orchestration tool such as Kubeflow Pipelines, Metaflow, or ZenML.
Let’s briefly go over each of the components below.
Data ingestion, extraction, and versioning
This component ingests data from a data source (external to the machine learning pipeline) as input. It then transforms the dataset into a format (i.e., CSV, Parquet, etc.) that will be used in the next steps of the pipeline. At this step, the raw and versioned data are also transformed to make it easier to trace their lineage.
This step collects the transformed data as input and, through a series of tests and validators, ensures that it meets the criteria for the next component. It checks the data for quality issues and detects outliers and anomalies. This component also checks for signs of data drift or potential training–serving skew to send logs to other components or alert the data scientist in charge.
If the validation tests pass, the data is sent to the next component, and if it fails, the error is logged, and the execution stops.
Data preprocessing and feature engineering
The data cleaning, segregation, and feature engineering steps take the validated and transformed data from the previous component as input. The processes involved in this step depend on the problem you are solving and the data. Processes here may include:
- Feature selection: Select the most appropriate features to be cleaned and engineered.
- Feature cleaning: Treating missing feature values and removing outliers by capping/flooring them based on code implementation.
- Feature transformation: Transforming skewed features in the data (if applicable).
- Feature creation: Creating new features from existing ones or combining different features to create a new one.
- Data segregation: Splitting data into training, testing, and validation sets.
- Feature standardization/normalization: Converting the feature values into similar scale and distribution values.
- Publishing features to a feature store to be used for training and inference by the entire organization.
Again, what goes on in this component is subjective to the data scientist’s initial (manual) data preparation process, the problem, and the data used.
Model training and tuning
This component can retrieve prepared features from the feature store or get the prepared dataset (training and validation sets) as input from the previous component.
This component uses a range of pre-set hyperparameters to train the model (using grid-search CV, Neural Architecture Search, or other techniques). It can also train several models in parallel with different sets of hyperparameter values. The trained model is sent to the next component as an artifact.
The trained model is the input for this component and is evaluated on the validation set. You can analyze the results for each model based on metrics such as ROC, AUC, precision, recall, and accuracy. Metrics are usually set based on the problem. Those metrics are then logged for future analysis.
Model analysis and validation
- 1 Gauges the model’s ability to generalize to unseen data.
- 2 Analyzes the model’s interpretability/explainability to help you understand the quality and biases of the model or models you plan to deploy. It examines how well the model performs on data slices and the model’s feature importance. Is it a black-box model, or can the decisions be explained?
If you train multiple models, the component can also evaluate each model on the test set and provide the option to select an optimal model.
Here, the component will also return statistics and metadata that help you understand if the model suits the target deployment environment. For example:
- Is it too large to fit the infrastructure requirements?
- How long does it take to return a prediction?
- How much resource (CPU usage, memory, e.t.c.) does it consume when it makes a prediction?
If your pipeline is in deployment, this component can also help you compare the trained model’s metrics to the ones in production and alert you if they are significantly lower.
Model packaging and registering
This component packages your model for deployment to the staging or production environments. The model artifacts and necessary configuration files are packaged, versioned, and sent to the model registry.
Containers are one helpful technique for packaging models. They encapsulate the deployed model to run anywhere as a separate scoring service. Other deployment options are available, such as rewriting the deployed code in the language for the production environment. It is most common to use containers for machine learning pipelines.
You can deploy the packaged and registered model to a staging environment (as traditional software with DevOps) or the production environment. The staging environment is for integration testing. The staging environment is the first production-like environment where models can be tested with other services in the entire system that enable the application—for example, deploying a recommendation service and testing it with the backend server that routes the client request to the service.
Some organizations might opt for staging on a container orchestration platform like Kubernetes. It depends on what tool you are using for pipeline orchestration.
Although not recommended, you can also deploy models that have been packaged and registered directly into the production environment.
Model scoring service
The deployed model predicts client requests in real-time (for online systems) or in batches (for offline systems). The predictions are logged to a monitoring service or an online evaluation store to monitor the model’s predictive performance, especially for concept/model drift.
You can adopt deployment strategies such as canary deployment, shadow mode deployment, and A/B testing with the scoring service. For example, you may deploy multiple challenger models with the champion model in production. They will all receive the same prediction requests from clients, but only the champion model will return prediction results. The others will log their predictions with the monitoring service.
Performance monitoring and pipeline feedback loop
The final piece in the pipeline is the monitoring component, which runs checks on the data. It also tracks the collected inference evaluation scores (model metrics or other proxy metrics) to measure the performance of the models in production.
Some monitoring components also monitor the pipeline’s operational efficiency, including:
- pipeline health,
- API calls,
- requests timeout,
- resource usage, and so on.
For a fully automated machine learning pipeline, continuous integration (CI), continuous delivery (CD), and continuous training (CT) become crucial. Pipelines can be scheduled to carry out CI, CD, or CT. They can also be triggered by:
- 1 model drift,
- 2 data drift,
- 3 on-demand by the data scientist in charge.
Automating your ML pipeline becomes a crucial productivity decision if you run many models in production.
How to build an end-to-end machine learning pipeline
You build most pipelines in the following sequence:
- 1 Define the code implementation of the component as modular functions in a script or reuse pre-existing code implementations.
- 2 Containerize the modular scripts so their implementations are independent and separate.
- 3 Package the implementations and deploy them on a platform.
Defining your components as modular functions that take in inputs and return output is one way to build each component of your ML pipeline. It depends on the language you use to develop your machine learning pipeline. The components are chained with a domain-specific language (DSL) to form the pipeline.
See an example of such a script written in a DSL for defining an ML pipeline in Kubeflow Pipeline below:
Packages and containers
You could decide to use a container tool such as Docker or another method to ensure your code can run anywhere.
Orchestration platforms and tools
Pipeline orchestration platforms and tools can help manage your packaged scripts and containers into a DAG or an orchestrated end-to-end workflow that can run the steps in sequence.
Machine Learning pipeline tools
The following are examples of machine learning pipeline orchestration tools and platforms:
- 1 Metaflow.
- 2 Kedro pipelines.
- 3 ZenML.
- 4 Flyte.
- 5 Kubeflow pipelines.
Metaflow, originally a Netflix project, is a cloud-native framework that couples all the pieces of the ML stack together—from orchestration to versioning, modeling, deployment, and other stages. Metaflow allows you to specify a pipeline as a DAG of computations relating to your workflow. Netflix runs hundreds to thousands of machine learning projects on Metaflow—that’s how scalable it is.
Metaflow differs from other pipelining frameworks because it can load and store artifacts (such as data and models) as regular Python instance variables. Anyone with a working knowledge of Python can use it without learning other domain-specific languages (DSLs).
Kedro is a Python library for building modular data science pipelines. Kedro assists you in creating data science workflows composed of reusable components, each with a “single responsibility,” to speed up data pipelining, improve data science prototyping, and promote pipeline reproducibility.
Learn how you can build ML pipelines with Kedro in this article.
ZenML is an extensible, open-source MLOps framework for building portable, production-ready MLOps pipelines. It’s built for data scientists and MLOps engineers to collaborate as they develop for production.
Learn more about the core concepts of ZenML in the documentation.
Flyte is a platform for orchestrating ML pipelines at scale. You can use Flyte for deployment, maintenance, lifecycle management, version control, and training. You can also use it with platforms like Feast, PyTorch, TensorFlow, and whylogs to do tasks for the whole model lifecycle.
Kubeflow Pipelines is an orchestration tool for building and deploying portable, scalable, and reproducible end-to-end machine learning workflows directly on Kubernetes clusters. You can define Kubeflow Pipelines with the following steps:
Step 1: Write the code implementation for each component as an executable file/script or reuse pre-built components.
Step 2: Define the pipeline using a domain-specific language (DSL).
Step 3: Build and compile the workflow you have just defined.
Step 4: Step 3 will create a static YAML file that can be triggered to run the pipeline through the intuitive Python SDK for pipelines.
Kubeflow is notably complex, and with slow development iteration cycles, other K8s-based platforms like Flyte are making it easier to build pipelines. But deploying a cloud-managed service like Google Kubernetes Engine (GKE) can be easier.
DEMO: End-to-end ML pipeline example
In this example, you will build an ML pipeline with Kubeflow Pipelines based on the infamous Titanic ML competition on Kaggle. This project uses machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
The dataset also provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age, and survival.
- In this demo, you will use MiniKF to set up Kubeflow on AWS. Arrikto MiniKF is the fastest and easiest way to get started with Kubeflow. You can also use MiniKF to set up Kubeflow anywhere, including your local computer. You can learn more about how to set up Kubeflow with MiniKF on Google Cloud and your local computer in the documentation.
- If you don’t already have an AWS account, create one.
- Using Arrikto MiniKF via AWS Marketplace costs $0.509/hr as of the time of writing this. The demo takes less than an hour to complete, so you shouldn’t spend more than $3 following this demo.
- This demo uses Arrikto MiniKF v20210428.0.1 and this version installs the following:
The demo steps also work with the latest Arrikto MiniKF v20221221.0.0 at the time of writing this. You can follow this tutorial in the official documentation to learn how to deploy Kubeflow with MiniKF through the AWS Marketplace.
If you have deployed Kubeflow with MiniKF, let’s jump into the Kubeflow dashboard to set up the project:
To get started, click on (1) “Notebooks” and (2) “+NEW SEVER”.
Specify a name for your notebook server:
Leave others as default (depending on your requirements, of course) and click “ADD VOLUME” under the Data Volumes category:
You will now see a new data volume added with the name you specified for your server and “-vol-1/” as a suffix:
You can now launch the notebook server:
This might take a couple of minutes to set up, depending on the number of resources you specified. When you see the green checkmark, click on “CONNECT”:
That should take you to the Jupyterlab launcher, where you can create a new notebook and access the terminal:
When you launch the terminal, enter the following command (remember to enter your data volume name):
(3) Launch the `layer_kubeflow_titanic_demo.ipynb` notebook:
After running the first code cell, restart your kernel so that the changes can take effect in the current kernel:
Kale helps compile the steps in your notebook into a machine learning pipeline that can be run with Kubeflow Pipelines. To turn the notebook into an ML pipeline,
(1) Click the Kale icon, and then
(2) Click enable:
Kale will automatically detect the steps it should run and the ones it should skip as part of the exploratory process in the notebook. In this notebook, Kale classes all the steps into a component as they all take input and return an output artifact.
(1) You can now edit the description of your pipeline and other details. When you are done,
(2) click on “COMPILE AND RUN”:
If all goes well, you should see a visual like the one below. Click on “View” beside “Running pipeline…” and a new tab will open:
You should be able to view a pipeline run and see the DAG (Directed Acyclic Graph) of the Kubeflow Pipeline you just executed with Kale through the Pipeline UI:
Now to see the result your model returned for the serving step, click on the “randomforest” step > go to “Visualizations” and scroll down to “Static HTML” section and view the prediction result for the last cell:
In this case, based on the dummy data passed in the serving step for the notebook, the model predicted that this particular passenger would not survive the shipwreck.
You can also get the URL endpoint serving your model by taking the following steps:
Click “Models” in the sidebar and observe that a model is already being served. Observe the Predictor, Runtime, and Protocol. Click on the model name.
You will see a dashboard to view the details of the model you are serving in production.
(1) Monitor your model in production with Metrics and logs to debug errors. You can also see the
(2) “URL external” and
(3) “URL internal”, the endpoints where you can access your model from any other service request or client. The “URL external” can be re-routed to your custom URL.
For now, we will access the model via the terminal through the “URL internal” endpoint. Copy the endpoint and head back to your Jupyterlab terminal. Save the endpoint in a variable with the following command:
You should get the same response as the one from the Pipeline notebook:
Congratulations! You have built an end-to-end Pipeline with Kubeflow.
Challenges associated with ML pipelines
Some challenges you will likely encounter as you work with ML pipelines include the following:
- 1 Infrastructure and scaling requirements.
- 2 Complex workflow interdependencies.
- 3 Scheduling workflows is a dilemma.
- 4 Pipeline reproducibility.
- 5 Experiment tracking.
Infrastructure and scaling requirements
The promise of machine learning pipelines materializes well when you have the excellent infrastructure they should run on. Companies such as Uber, Airbnb, etc. host their infrastructure and have the budget to build it in-house. This is unrealistic, mainly for smaller companies and startups that rely on cloud infrastructure to get their products to market.
Using cloud infrastructure to run data, training, and production pipelines can lead to exponential costs and bills if you don’t appropriately monitor them. You may also encounter situations where different workflow components require significantly different infrastructure needs.
Machine learning pipelines allow you to run experiments efficiently and at scale, but this purpose might be defeated if resources and budget limit you.
Complex workflow interdependencies
Implementing pipeline workflows can be complicated due to the complex interdependence of pipeline steps, which can grow and become difficult to manage.
Scaling complex workflow interdependencies can also be an issue, as some components might require more computational resources than others. For example, model training can use more computing resources than data transformation.
Workflow scheduling dilemma
Scheduling the workflows in a machine learning pipeline and providing resiliency against errors and unplanned situations can be very tricky. When you use a workflow scheduler, it can be difficult to specify all the actions the orchestrator should take when a job fails.
Running tens to hundreds of pipelines at scale, with multiple interconnected stages that may involve various data transformations, algorithmic parameters, and software dependencies, can affect pipeline reproducibility.
Often forgotten, but the infrastructure, code, and configuration that are used to produce the models are not correctly versioned and are in a non-consumable, reproducible state. — Ketan Umare, Co-Founder and CEO at Union.ai, in an AMA session at MLOps.community 2022.
In other cases, you may build your pipelines with specific hardware configurations running on an operating system and varying library dependencies. But when compiling the pipeline to run in a different environment, these environmental differences can impact the reproducibility of machine learning pipelines.
Best practices for building ML pipelines
From sifting through community conversations to talking to engineers from companies like Brainly and Hypefactors to distilling top learnings from Netflix, Lyft, Spotify, and so on, learn some of the best practices for building ML pipelines below.
Track your machine learning pipelines
We automatically attach an experiment tracker to every pipeline we launch without our users noticing. For us, this ensures at least a minimum set of parameters being tracked… In principle, we see experiment tracking as a tool that should be used with the pipeline. We recommend using a pipeline to track your experiments—that’s how you’ll ensure they are reproducible. — Simon Stiebellehner, MLOps Lead Engineer and MLE Craft Lead at TMNL, in “Differences Between Shipping Classic Software and Operating ML Models” on MLOps LIVE.
You want to leverage techniques and technologies to make your pipeline reproducible and debuggable. This involves exploring practices, including:
- Version control – to manage dependencies, including code, data, configuration, library dependencies, pipeline metadata, and artifacts, allowing for easy tracking and comparing pipeline versions.
- Implementing system governance. Depending on the steps in your pipeline, you can analyze the metadata of pipeline runs and the lineage of ML artifacts to answer system governance questions. For example, you could use metadata to determine which version of your model was in production at a given time.
- Using dedicated tools and frameworks that support tracking and management of pipelines, such as neptune.ai or MLflow, can provide comprehensive tracking and monitoring capabilities.
The tracking tools allow you to:
- log experiment results,
- visualize pipeline components,
- document details of the steps to facilitate collaboration among team members,
- monitor pipeline performance during execution, making it easier to track the evolution of the pipeline over time,
- manage the pipeline’s progress.
ReSpo.Vision uses ML in sports data analysis to extract 3D data from single-view camera sports broadcast videos. They run a lot of kedro pipelines in the process.
Wojtek Rosiński, CTO at ReSpo.Vision says: “When we use Neptune with kedro, we can easily track the progress of pipelines being run on many machines, because often we run many pipelines concurrently, so comfortably tracking each of them becomes almost impossible. With Neptune, we can also easily run several pipelines using different parameters and then compare the results via UI.”
Below, you can see an example of how it looks like in the Neptune’s UI.
Compose your pipeline components into smaller functions
Use pipelining tools and the SDK to build your pipeline with reusable components (defined as small functions). See an example that follows the ZenML pipeline workflow:
This way, you can implement your workflow by building custom or reusing pre‑built components. This can make building new pipelines easier and quicker, debugging existing ones, and integrating them with other organizational tech services.
Do not load things at the module level; this is often a bad thing. You don’t want the module load to take forever and fail. — Ketan Umare, Co-Founder and CEO at Union.ai, in an AMA session at MLOps.community 2022.
Below is another example of a step defined as a function with the Prefect orchestration tool:
Write pipeline tests
Another best practice is to ensure you build a test suite that covers each aspect of your pipeline, from the functions that make up the components to the entire pipeline run. If possible (and depending on the use case), be willing to automate these tests.
To guarantee that models continue to work as expected during continuous changes to the underlying training or serving container images, we have a unique family of tests applicable to LyftLearn Serving called model self-tests. — Mihir Mathur, Product Manager at Lyft, in “Powering Millions of Real-Time Decisions with LyftLearn Serving” blog 2023.
Composing your pipeline components into smaller functions can make it easier to test. See an example from Lyft’s model self-tests where they specified a small number of samples for the model inputs and expected outputs in a function called `test_data`:
Write your tests locally because, in most cases where your stack and setup make local testing impossible, your users will likely end up testing in production. Containerizing your steps can make testing your pipelines locally or in another environment easier before deploying them to production.
What are the pipeline tests you should write? Eugene Yan, in his article, listed a scope map for what effective pipeline tests should look like, including unit tests, integration tests, function tests, end-to-end tests, and so on. Check out the extensive article.
Building end-to-end machine learning pipelines is a critical skill for modern machine learning engineers. By following best practices such as thorough testing and validation, monitoring and tracking, automation, and scheduling, you can ensure the reliability and efficiency of pipelines.
With a solid understanding of each pipeline stage’s components, structure, and challenges, you can build robust and scalable pipelines that streamline your ML workflow.
- Intro to Kubeflow Pipelines – YouTube
- Building and Managing Data Science Pipelines with Kedro – neptune.ai
- How Metaflow Became Netflix’s Beloved Data Science Framework • Julie Amundson
- Flyte: MLOps Simplified – MLOps Community