In this article, I’m going to compare Kedro, Metaflow, and ZenML, but before that, I think it’s worth taking a few steps back. Why even bother using ML orchestration tools such as these three? It is not that hard to start a Machine Learning project. You install some python libraries, initiate the model, train it, and voilà! It is ready. There are also a ton of people out there telling you that this is data science, and typing model.fit in your locally running Jupyter Notebook will guarantee you your dream job and solve all problems in the universe. But that is just one small part of the truth.
Just to make myself clear, there is nothing wrong with experimenting and reporting with Jupyter Notebooks. The problem is that they can’t provide a software development framework that can be used in most of the industry ML projects.
Implementing machine learning projects (and creating ML-based products) is as complex as any other software development. For instance, in real-world ML projects, you must be able to:
- Ingest the data in multiple formats
- Pre-treat the data, clean the datasets, etc.
- Do feature engineering
- Train your models
- Test them
- Deploy your model
- Implement CI/CD
- Track your code using git
- Track your models’ metrics
And lots of other stuff, that isn’t possible to do using notebooks (if someone says otherwise, they are probably misleading you). Thus, your code must be maintainable, scalable, trackable, and feasible to work in a more-than-one-person team.
That’s why data scientists must learn Machine Learning orchestration tools, and this post aims to be a deep dive into some of the most used MLOps orchestration tools available – Kedro, Metaflow, and ZenML. We are going to take a look into their:
- Main purposes
- Code structure
- Main features
- And pros and cons
We are also going to show:
- How easy it is to (and also how to) get started with each one of them
- And show their real-world capabilities
Okay… but why specifically those 3 – Kedro, Metaflow, and ZenML?
There are a lot of frameworks implemented in order to solve the issues raised before. We’ve chosen Kedro, Metaflow, and ZenML because they are some of the most used tools to orchestrate Machine Learning pipelines:
- They have free-to-use licenses: you won’t pay a high bill to structure your ML project
- They are widespread: there are a lot of people out there using them, making it easy to find people to help you when needed
- They have lots of learning material online: This way you (and the following people that join your team) can jump into and start your first project or maintain the existing ones
- They have many integrations with the most used tools in the industry: you will be able to integrate them in an existing cloud framework or even expand it to do more stuff
Thus, those are the easiest ways to get your Machine Learning out of your Jupyter Notebook and make it face the real world to solve real problems for real people.
Got it. And now, which one should I choose – Kedro, Metaflow, or ZenML?
There are many factors to be taken into consideration when choosing an orchestration tool. It is going to be the basis of your machine learning project. This choice will influence, for example:
- The time needed to set a project up
- The time needed to change specific parts of the project (such as a data source, a code snippet, etc)
- Time to train new employees to understand running projects
- Costs of the project that include (but are not limited to) cloud, processing power, and human labor
That’s why it is so important to pay attention to which orchestration tool you use. And that’s what we are going to help you find out.
If you don’t have time to read it all, you can jump right here for a quick summary.
Now, let’s dive in!
At first, let’s have a glance at how each documentation defines its tool:
“Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning. Kedro is hosted by the LF AI & Data Foundation.”
It is available at this link.
“Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. Metaflow was originally developed at Netflix to boost the productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.”
It is available at this link.
By this point, everything looks very similar. But how do these tools differ from each other? And yes, there are important differences to pay attention to.
Who built it?
All three are open-source free-to-use frameworks, but all were developed by companies:
- 1 Kedro was created by QuantumBlack, a Mckinsey company, and was recently donated to Linux Foundation
- 2 ZenML, on the other hand, is developed by a small company based in Munich, Germany
- 3 Metaflow was developed within Netflix
Are they well documented?
All 3 tools have documentation to provide information that can be used as a basis for their implementation. Code documentation is especially important for software development because it is a guide written by the ones who actually created the software, so there can be no other better medium to explain how to use it.
It gets even more important when maintaining a running project because there is always some need to check some specificities and code samples of a certain part of the framework, such as understanding the syntaxes or acronyms.
Kedro has extensive documentation, which contains tutorials and example projects. Each part of the software has a section of its own with code snippets and samples, from the simplest features to the advanced ones. You can install a template project and start your first project using sample datasets, and train your first model with few lines in your terminal.
ZenML documentation is also very complete. Everything is well written and there are useful images that help us to understand what is happening there. There are also copy-paste tutorials to hit the ground running and see ZenML work in practice. In my humble opinion, though, it lacks code snippets to demonstrate how to apply some concepts.
Metaflow docs also have plenty of use cases and examples. It includes some specific examples of each step of the usage, both locally and in a production environment. Additionally, there is a blog maintained by Metaflow’s project team, so it can be considered official documentation. This blog contains more examples and code samples. As the blog is not structured documentation, it can be hard to build the first project step by step.
Metaflow is the only one of three that can be run in R, so if this is your preferred coding language, maybe your choice of tool is already made.
Documentation: Kedro = ZenML = Metaflow
Now it’s time to understand how these tools perform in a more practical way. We are going to see how to learn them and how they can be used in real-world projects.
Learning how to use
Every software tool is as useful as how easy it can be to use it. Let’s take a look at some of the available materials for each tool.
Kedro has a large community, which also provides lots of information on how you can create projects with it. In addition to the official documentation (which comes with a template project), there is a full kedro course for free on YouTube, hosted by the DataEngineerOne channel. It goes from the very beginner level (like installing kedro) to most of the advanced features. There are also 4 step by step posts (written by me) in medium showing how to deploy a model using kedro, MLflow, and fastAPI (part 1, part 2, part 3, and part 4).
There are a lot of ZenML project videos on YouTube, and lots of them are held by their official profile, which you can find here. There are also some implemented tutorial projects, which can be downloaded from the command interface itself and then grabbing the guides to implement your own project, and the how-to is here.
Basically, you can run `zenml example list` and pick one of the available examples to download into your current directory by typing `zenml example pull quickstart`.
Metaflow has less video material and courses than its competitors. Although, again, the Outerbounds blog provides plenty of reading materials about the tool and its environment, including examples and code samples. There are some videos available on Youtube showing some implementations, like this one: Metaflow: The ML Infrastructure at Netflix, but it seems until the day this post was written that there are no full courses available. There is also a split in material because Metaflow is the only one here that can be run in R and in Python, so there is learning material illustrating implementations in both of them.
Learning how to use: Kedro = ZenML = Metaflow
Kedro: nodes and pipelines
Kedro has a code structure defined by nodes and pipelines. Every node is a Python function. Pipelines are the orchestration of the nodes, specifying all inputs, outputs, and parameters of your pipeline. It basically describes which data gets into each node, what are the outputs, and what are the next steps.
There is no need to specify which node comes before the next one. Kedro automatically orchestrates the execution order of your whole pipeline (if there ain’t no conflicts in the execution, such a datasource be both input and output from the same node). This way, if your data structure remains the same, your sources may vary without the need to change the main code. Also, it allows you to see what is happening in each step of the pipeline, since the intermediate data is being stored.
All the code will be located inside the “src” folder, and all functions are created inside node and pipeline python files. It is possible (and highly recommended) to split your code into subfolders (such as preprocessing, feature engineering, models training, etc) to keep it organized, this way it is possible to run each step of the pipeline individually. The whole code architecture, including the code development lifecycle, can be seen here.
Kedro also provides an environment interface. Kedro environments are meant to separate the code sources used within the pipeline, this way you can separate development from production, or even create other partitions within your projects. You can read more about it here. It automatically creates local and base environments, but you can create as many as you need.
When it comes to adding additional features to particular nodes (or all of them), kedro uses a Hooks structure. For example, you can configure a particular behavior like logging the time when a dataset loading occurs. The hooks’ structure can be found here. Kedro also has a secret management structure that uses a credentials yaml file, and you can read more about it here.
ZenML: steps and pipelines
ZenML is built on steps. It defines a decorator for it (@step), and then each step of the pipeline is written as a Python function. It contains code for ingesting data, data cleaning, model training, or even metrics evaluation. The steps can be organized using pipelines, which will build all steps altogether in order to allow your pipeline to run at once. The main advantage of using ZenML built-in classes for steps, inputs, and outputs, is that all integrations get easier.
The pipelines, on the other hand, are the python functions that specify the steps’ execution order, with their corresponding inputs and outputs. All code is built in the run.py file, which is going to contain the steps and the pipeline itself.
ZenML project infrastructure also defines a Stack which defines how your pipeline will run. It normally includes:
- An Orchestrator: the framework that orchestrates the pipeline execution (such as Airflow and Kubeflow)
- An Artifact Store: the actual storage of the data, mostly for data for intermediate pipeline steps (since ingestion usually is a step of its own)
- A Metadata Store: a store to keep track of the data regarding the pipeline run itself
Because of its structure, changing components of the stack is easier, like the data storage, cloud infrastructure or orchestration. There is also an explanation of each stack component available here. You can use, for example, a local development stack (that runs in your machine) and a production stack, that runs on a cloud server.
The stacks are particularly useful to use in projects where redundancy is important. If your main system fails (such as depending on a particular cloud provider), you can switch to a different infrastructure with little to no interaction, or even automate it. It can also include an artifact registry, which can even manage the project’s secrets.
Unlike Kedro, ZenML allows you to customize your whole pipeline creation, thus there are some best practices that we strongly recommend following. You can find the guide here.
Metaflow: the one with the UI (and also steps)
Metaflow is very similar to Airflow, as it’s based on DAGs (Directed Acyclic Graph), so it can be easier for Airflow users to understand it.
Just like ZenML, Metaflow also uses steps to decorate functions, so it can be quite similar to users with ZenML experience. However, it uses a structure that specifies the flow within the steps themselves, which may seem a little confusing. Also, in order to create a flow, you should first create a class (MyFlow, for example) that is going to contain its steps.
Most of the functionalities are built with decorators. For instance, you can use a resources decorator to specify computational resources (such as memory, CPU, and GPU usage), the Python environment that should be used for each step, the number of retries, the user responsible, et cetera.
The main advantage of using Metaflow is that it has a built-in UI for tracking metrics of each run within the project. A sample UI can be found here.
Metaflow automatically tracks some metrics such as the user that ran the experiment, the time it took, the steps it had, and running status.
Code structure: Kedro > ZenML > Metaflow
Kedro: the data catalog and the parameters file
Kedro has a nice way to abstract the data sources out of the code: the data catalog files. These are YAML files where all sources (input and output data) are described, with information regarding their saving path, format, and other arguments. This way they are passed to the pipeline as parameters, which makes the update a lot easier. The data can be stored in multiple formats, multiple paths, and with many different parameters.
You can read more about it here. This yaml sample shows how to import a CSV dataset locally located.
``` cars: type: pandas.CSVDataSet filepath: data/01_raw/company/cars.csv load_args: sep: ',' save_args: index: False date_format: '%Y-%m-%d %H:%M' decimal: . ```
The main advantage of passing the datasets of your project as parameters is that if you have to change a data source (its path, source, or file) you won’t need to update all your code to match it. The Catalog is a default kedro class, and there are several integrations ready to use. In the worst-case scenario, the data cleaning nodes would need to be updated and your intermediate steps of the pipeline won’t change.
And when I say it can have any format I mean, it can be a piece of SQL code (from multiples sources and clouds), a CSV table, a parquet one, a Spark dataset, MS Excel table, JPEG image, etc. It can be located in cloud (AWS, GCP, and Azure), Hadoop File System, or even an HTML website. The catalog is also used to save the models and the metrics files, making it easier to implement them and manage all sources.
Kedro also makes another use of YAML files as params files. It may contain all parameters your models or pipeline need. This way it can be easy to track and change them when required.
As said before, ZenML is based on steps. So, for instance, data ingestion is simply something you can write as a python script inside a step.
Most of the power of ZenML comes from its integrations. It provides plenty of tools to integrate data ingestion from multiple sources, as well as feature stores, artifacts store, and so on. We are going to hit on ZenML integrations further in the next section.
Because ZenML is highly customizable, you may create your own code structure including parameters and configuration files, in order to make your project look a little more organized (and a little less hard coded) and easier to maintain.
It is quite similar to ZenML ingestion, although, Metaflow only supports AWS integrations, which can be an obstacle when working with a different cloud supplier. So, basically, in order to integrate data sources and other features you might have to do some hard coding in Python (or R).
Ingesting data: Kedro > ZenML > Metaflow
Most of the kedro capability comes from its plugins. It can expand the kedro capabilities by running third-party tools natively, and they can all work together to their full potential. And while kedro adds them as plugins, they actually do their job as an independent component integrated into the kedro structure. There are plenty of them, such as:
- Kedro-Neptune is a plugin that connects kedro to neptune.ai, which allows experiment tracking and metadata management.
- Kedro-mlflow: allows to implement MLflo for model tracking and model serving
- Kedro-viz: an amazing tool that allows visualizing your whole ML pipeline
- Kedro-docker a tool that makes it easier to create, deploy and run kedro ML applications using docker;
The main advantage of the plugins is that all of them can be easily installed with PyPI, and they are compatible with the kedro structure, so it is easy to get them up and running in an existing project. You can look in the official documentation and find one that can be the most useful to solve your problems.
ZenML supports a ton of third-party integrations. They contain orchestration tools, distributed processing tools, cloud providers, deployment, feature stores, monitoring, and so on. ZenML works as the glue holding the whole pipeline together (as claimed by themselves). These tools are called inside the ZenML Python files in a very similar way as you would call them outside ZenML, for instance, using `zenml.integraions.sklearn`, to import sklearn for example.
In ZenML there is a Python integrations package, which contains all the interfaces for a lot of different tools. You can read more about the already integrated tools (and the ones they are currently working on) here. When it comes to installing an integration, it is only a four words command interface away from you, such as `zenml integration install sklearn`.
Could be useful
As it’s well known, Netflix has a strong relationship with AWS. Thus, Metaflow is well integrated with most of the AWS services, using S3 storage, AWS Batch and Kubernetes for computations, AWS Fargate and RDS for metadata storage, AWS Sagemaker Notebooks and AWS Step Functions, Eventbridge and Argo Workflows for scheduling. You can read more about it here.
Integrations: Kedro = ZenML > Metaflow
All tools we’ve been talking about can be used to deploy models to production and orchestrate their training and evaluation. However, some of their features may fit better for certain use cases.
As Kedro has a highly organized code structure and abstracts the data and model sources and paths, it is a good choice for projects that are supposed to be built by large teams and need to be maintained over a long time, so new features based on plugins and hooks can be added to the project without the need to change the entire pipeline structure. Also, as it can work seamlessly with PySpark, there is no problem working with big data or with any cloud platform for deployment.
ZenML has a more customizable code structure than Kedro, and it may be suited well for a prototype project. One of the main advantages of ZenML is that it can work with different stacks without changing the code. So, if the project needs to change the cloud provider, for example, due to unavailability, it can be done in seconds without much setup. This provides robustness and scalability.
Metaflow is the only one out of three that can work with R. Thus, if there is some model not yet in Python, Metaflow may make it possible to deploy it to production using R. Also, another advantage of Metaflow is that it tracks every run metric automatically. Then it may be a good fit for prototyping and testing different approaches to projects. Also, owing to its compatibility with AWS, it should be your pick if you have some dependence on AWS.
Advantages and limitations
- Kedro’s structure is particularly useful to maintain order in complex projects, which would be difficult with other tools.
- Its plugins can provide a wide range of operations to provide what kedro does not make natively;
- Strong community: community developments are frequent, so there is always bug fixing and release of new plugins;
- Node and pipeline architecture avoids code repetition as it allows functions reuse and thus clean coding;
- It is possible to use different environments, such as dev and prod to separate the data and model sources.
- Because it is well structured, kedro can be an overkill if your project is too simple;
- ZenML is quite simple and easy to implement. You can start your first project really fast and grow it into a complex one with time.
- It has lots of integrations with third-party tools, which allows you to implement a broad range of projects;
- The Stack structure allows you to implement your code in many different ways and make it possible to keep dev and prod separated.
- As the code is simple, it can be tricky to keep everything organized as the project grows. It will require discipline to keep things up in projects with multiple steps and with complicated data flows. Thus, it is really important to read and use the proposed best practices;
- It is held by a company, thus there is less community interaction around bug fixing or new features creation possibilities.
- Metaflow automatically tracks metrics and pipeline running info, which is quite useful when running in production;
- Available in R: one of the only, if not the only one framework that can run in R;
- It has a built-in user interface to display all metrics tracked;
- Based on DAGs: this can make Metaflow particularly useful for the ones familiar with graphs programming, such as airflow.
- Lacks documentation and additional online learning material;
- Code can easily get confusing with the use of many decorators;
- Similarly to ZenML, it can be tricky to keep everything organized as the project grows.
- It may be a little hard to learn for beginners.
Kedro vs ZenML vs Metaflow: summing it all up!
Note: the comparison was last updated in January 2023. If you see any out-of-date information, feel free to contact us at firstname.lastname@example.org.
After looking at all these methodologies, it is clear that there are strong differences between each tool. Each one has its own advantages and limitations, and it can suit better or worse each project’s needs. In short:
- Kedro has some interesting capabilities, and it’s especially good when working on complex projects. It abstracts the data sources, and the models and can integrate well with many other tools. It can be set up to suit most of the Machine Learning modern projects, including all current industry best practices. Their way to deal with source and model sources, abstracting them, is, in my humble opinion, just amazing.
- ZenML is also a strong competitor, with some advantages on its side in comparison to Kedro. It is more customizable and can work with multiple stacks without the need for any code changes, and it also integrates well with lots of other tools. However, it lacks some functionalities, mostly when dealing with data and model sources.
- Metaflow is by far the least structured methodology we are going to see here. It focuses mostly on orchestrating the steps and tracking the step execution metrics. So, every other feature your project needs might need to be included via coding it by hand using other Python functionalities. It may not be impossible, but it certainly has some disadvantages. Yet, the automatic metric tracking and UI are pretty cool in production environments.
To finish, I hope that you now have more information to choose which one of these tools suits your use-case better, just, keep in mind that there is always a tradeoff between each tool’s capabilities. Thank you for reading this blog till the end, and see you next time!