Data is the main part of every machine learning model. Your model is only as good as the data it’s built with, and you can’t build a model at all without data. So, it makes sense to train your models only with accurate and authentic data.
In the course of running their operations, many organizations arrive at a point where they need to improve data tracking and transferring systems. This improvement might lead to:
- finding errors in the data,
- implementing effective changes to reduce risk.
- creating better data mapping systems.
Some writers say that data is the new oil. And just like oil becomes fuel for powering your car, data also has to go through a process from raw data to a model component or even a simple visualization.
Data scientists, data engineers, and Machine Learning engineers rely on data to build correct models and applications. It can help to understand the necessary journey of the data before it can be used to build accurate models.
This concept of a data journey actually has a real name—Data Lineage. In this article, we’re going to explore what data lineage means in machine learning, and see several paid and open-source data lineage tools.
What is Data Lineage in Machine Learning?
Data lineage in machine learning describes the journey of data from collection to usage. It shows the process of understanding, recording, visualizing changes, and transforming the data from before final consumption. It’s the detailed process of HOW the data was transformed, WHAT exactly was transformed, and WHY it was transformed.
In machine learning, a model is usually a compressed version of the data it’s been trained on. When you know the data lineage of a dataset, it’s easier to achieve reproducibility.
Criteria for choosing Data Lineage tools
Data lineage tools help you visualize and manage the whole journey of your data. When choosing a data lineage tool, there are some core features you should look for:
- Traceability: ability to trace and verify the history of data. It’s very important, and it helps you ensure that you have high-quality data.
- Immutability: immutability brings trust to data lineage tools. Immutability means that you can get back to previous versions of your dataset after you made changes.
- Open-source: open source data lineage tools have the advantage of being free to use and are constantly being improved on.
- Integration: many stages and tools are involved in a data journey, from storage to different transformations (wrangling, cleaning, ingestion, etc.). Because of this, data lineage tools should be able to integrate easily with third-party applications.
- Versioning: a good data lineage tool should keep track of different versions of the data and model changes as they happen over various transformations and tuning.
- Collaboration: for remote data science teams, it’s important to collaborate on shared data. Also, it’s important to know who made changes to the data and why the changes were made.
- Metadata Store: a data lineage tool should include a metadata store.
- Big Data Handling: many machine learning models require big data, so data lineage tools should be able to handle and process big data efficiently.
Now, let’s look at some of the best data lineage tools in machine learning.
Data Lineage tools in Machine Learning
Let’s start with a summary table of different tools, and we’ll go through each of them in more detail below.
Open table in new window
Neptune is a metadata store for MLOps, built for research and production teams that run a lot of experiments.
It gives you a central place to log, store, display, organize, compare, and query all metadata generated during the machine learning lifecycle. Individuals and organizations use Neptune for experiment tracking and model registry to have control over their experimentation and model development.
- Neptune allows you to display and explore the metadata of datasets. With Neptune’s namespace or basic logging, you can store and track dataset metadata such as the md5 hash of the dataset, the location of the dataset, a list of classes, list of feature names (for structured data problems).
- Neptune allows you to version your models by logging model checkpoints.
- Neptune has a customizable UI that allows you to compare and query all your MLOps metadata.
To learn more about Neptune’s data lineage and versioning, check it out here.
MLflow is an open-source platform for building, deploying, and managing machine learning workflows. It was designed to standardize and unify the machine learning process. It has four major components that help to organize ML experiments:
- MLflow Tracking: this allows you to log your machine model training sessions (called runs) and run queries using Java, Python, R, and REST APIs.
- MLFlow Model: The model component provides a standard unit for packaging and reusing machine learning models.
- MLFlow Model Registry: The model registry component lets you centrally manage models and their lifecycle.
- MLflow Project: The project component packages code used in data science projects, ensuring it can easily be reused and experiments can be reproduced.
- MLflow Model Registry provides a suite of APIs and intuitive UI for organizations to register and share new versions of models as well as perform lifecycle management on their existing models.
- MLflow Model Registry works with the MLflow tracking component, which allows you to trace back the original run where the model and data artifacts were generated as well as the version of source code for that run, giving a complete lineage of the lifecycle for all models and data transformation
- MLflow Model Registry component integrates with Delta Lake Time Travel which:
- Provides a data store (data lake) for big data storage.
- Automatically versions data stored in the data lake as it is stored into the Delta table or directory.
- Allows you to get every version of your data using a version number or a timestamp.
- Allows you to audit, roll back data in case of accidental bad writes or deletes.
- You can reproduce experiments and reports.
To learn more about MLflow, check out the MLflow docs.
Check what are the differences between Neptune and DVC.
Pachyderm is a data platform that mixes data lineage with end-to-end pipelines on Kubernetes. It brings data-versioned controlled pipeline layers to data science projects and ML experiments. It does data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis.
Pachyderm is available in three versions:
- Community Edition: an open-source platform that you can use anywhere.
- Enterprise Edition: complete version-controlled platform with advanced features, unlimited scalability, and robust security.
- Hub Edition: a combination of the community edition and the enterprise edition. It takes off the workload of managing Kubernetes yourself.
- Pachyderm, like Git, helps you find data origins, then it tracks and versions data as it’s processed during model development.
- Pachyderm gives dates from the data origin to its present state.
- Pachyderm lets you quickly audit differences in data version tracks and rollbacks.
- Pachyderm provides dockerized MapReduce on big data.
- Pachyderm stores all the data in a central repository like Minio, AWS S3, or Google Cloud Storage, with its own specialized file system called the Pachyderm File System or PFS.
- Pachyderm does an update on all depending data when a change is made to the dataset.
- It manages and records transformations made to the data.
To learn more about Pachyderm, check out the Pachyderm docs.
Check what are the differences between Neptune and MLflow.
Truedat is an open-source data governance tool that gives end-to-end visualizations of the data from the model’s start to finish.
- Truedat gives an end-to-end (origin to finish) understanding of the data from a business and technical point of view.
- Truedat allows you to organize & enrich information through configurable workflows and monitor government activity.
- With the data lake management option, you can request, deliver & use governed data.
- Truedat offers insight into the data’s journey with time.
- Truedat provides data governance.
- Truedat offers object Lineage Tracing, lineage object filtering, and Point-in-Time visibility on the data.
- Truedat provides User/Client/Target Connection Visibility on the data.
- Truedat gives a Visual & Text Lineage View on the data journey.
- Truedat gives database Change Impact Analysis.
- Truedat can be integrated on Amazon S3, Amazon RDS, Azure, Big Query, Power BI, MySQL, PostgresSQL, Tableau, etc.
To learn more about Truedat, check out the Truedat docs.
CloverDX enables the organization of several data processes, improves and automates transparent data transformations. CloverDx brings together the design of transformation and workflows, including coding capabilities.
CloverDX provides data lineage for your dataset with a developer-friendly visual designer. It’s good for automating data-related tasks such as data migration, and it does it fast.
- CloverDX provides transparency and balance to data workflows.
- CloverDX hosts other data quality tools.
- CloverDX does error tracking and resolution for your data.
- Recyclable data operations.
- Self-sufficient data operations.
- CloverDX can be used standalone or embedded.
- CloverDX integrates with RDBMS, JMS, SOAP, LDAP, S3, HTTP, FTP, ZIP, and TAR.
SentryOne Document gives you powerful tools for ensuring your databases are continuously and accurately documented. Plus, the data lineage analysis capabilities help you ensure compliance by providing a visual representation of your data’s origin.
- Track data lineage with a visual display that clearly shows data dependencies across your environment.
- Document data sources including SQL Server, SQL Server Analysis Services (SSAS), SQL Server Integration Services (SSIS), Excel, Power BI, Azure Data Factory, and more.
- Easily manage documentation tasks and view logs with an easy-to-access cloud or software solution.
DVC is an open-source version control system for machine learning projects. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.
It makes use of existing tools such as Git and various CI/CD apps. It can be grouped into components.
- DVC offers all the advantages of a distributed version control system—lock-free, local branching, and versioning.
- DVC runs on top of any Git repository and is compatible with any standard Git server or provider (GitHub, GitLab, etc).
- Data pipelines describe how models and other data artifacts are built, and provide an efficient way to reproduce them. Think “Makefiles for data and ML projects” done right.
- It can integrate with Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or disc to store data.
- DVC has a built-in way to connect ML steps into a DAG and run the full pipeline end-to-end.
- DVC handles caching of intermediate results and does not run a step again if input data or code are the same.
Check what are the differences between Neptune and MLflow.
Spline (SPark LINEage) is a free, open-source tool for automated data lineage tracking and data pipeline structure. It was originally designed for Spark but the project has expanded to accommodate other data tools.
Spline has three major parts:
- Spline Server: it acts as a central repository for lineage metadata obtained from the spline agents via the Producer API and stores it in an ArangoDB.
- Spline Agents: Spline Agents listen for spark activities, then track and record the lineage information gotten from the various Apache data pipelines (spark activities), and send it in a standard format to the Spline server via its Producer API using REST or Kafka. The Spline Agent is a Scala library and can be used as a standalone, without the Spline server, depending on your use case.
- Spline UI: it’s a docker image or a WAR file that serves as the UI for Spline. The Spline Consumer API endpoint is accessed from the user browser directly on the Spline UI.
- Spline tracks and records job dependencies and then goes on to create an overview of how they interact and the transformations that occur at each transformation.
- Spline can be used within Azure Databricks.
- Spline handle big data processing well and it is easy to use.
- Spline has a visualization interface that shows lineage information.
- Spline makes it easy to communicate with the business team.
- Spline works well with structured data APIs e.g SQL, datasets, data frames, etc.
To learn more about Spline, check out the spline documentation.
As you can see, data lineage is very important if you want to do reproducible, high-quality work. Look through the tools listed in this article and see what fits your use case best. Thanks for reading!
Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects
5 mins read | Jakub Czakon | Updated October 20th, 2021
Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.
You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.
It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.
It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.
So check out these top tools for data version control that can help you automate work and optimize processes.
Data versioning tools are critical to your workflow if you care about reproducibility, traceability, and ML model lineage.
They help you get a version of an artifact, a hash of the dataset or model that you can use to identify and compare it later. Often you’d log this data version into your metadata management solution to make sure your model training is versioned and reproducible.
How to choose a data versioning tool?
To choose a suitable data versioning tool for your workflow, you should check:
- Support for your data modality: how does it support video/audio? Does it provide some preview for tabular data?
- Ease of use: how easy is it to use in your workflow? How much overhead does it add to your execution?
- Diff and compare: Can you compare datasets? Can you see the diff for your image directory?
- How well does it work with your stack: Can you easily connect to your infrastructure, platform, or model training workflow?
- Can you get your team on board: If your team does not adopt it, it doesn’t matter how good the tool is. So keep your teammates skillset in mind and preferences in mind.
Here’re are a few tools worth exploring.Continue reading ->