Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.
You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.
It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.
It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.
So check out these top tools for data version control that can help you automate work and optimize processes.
Neptune is a metadata store that was built for research and production teams that run many experiments. It is very flexible, works with many other frameworks, and has a stable user interface so you can effectively systematize your ML experiments and improve management.
At the core Neptune composed of three major components:
- Data versioning
- Experiment tracking
- Model registry
These components allow Neptune to serve as a connector between different parts of the MLOps workflow. The main purpose is to create a centralized place for all machine life-cycle metadata and make it easy for teams to store, organize, display, track the lineage, share and compare all metadata generated during model development.
It is a robust software that can store, retrieve, and analyze a large amount of data as well as facilitate efficient team collaboration and project supervision.
Neptune – summary:
- Provides user and organization management with different organization, project, and user roles
- Fast and beautiful UI with a lot of capabilities to organize runs in groups, save custom dashboard views and share them with the team
- You can use a hosted app to avoid all the hassle with maintaining yet another tool (or have it deployed on your on-prem infrastructure)
- Your team can track experiments that are executed in scripts (Python, R, other), notebooks (local, Google Colab, AWS SageMaker), and do that on any infrastructure (cloud, laptop, cluster)
- Extensive experiment tracking and visualization capabilities (resource consumption, scrolling through lists of images)
- Provides individuals and teams with notebook checkpointing and model registry to track model version and lineage.
Pachyderm is a complete version-controlled data science platform that helps to control an end-to-end machine learning life cycle. It comes in three different versions, Community Edition (open-source, with the ability to be deployed anywhere), Enterprise Edition (complete version-controlled platform), and Hub Edition (a hosted version, still in beta).
It’s a great platform for flexible collaboration on any kind of machine learning project.
Here’s what you can do with Pachyderm as a data version tool:
- Pachyderm lets you continuously update data in the master branch of your repo, while experimenting with specific data commits in a separate branch or branches
- It supports any type, size, and number of files including binary and plain text files
- Pachyderm commits are centralized and transactional
- Provenance enables teams to build on each other work, share, transform, and update datasets while automatically maintaining a complete audit trail so that all results are reproducible
The Best Pachyderm Alternatives
3. Delta Lake
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake – summary:
- Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
- Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
- Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
- Serializable isolation levels ensure that readers never see inconsistent data.
- Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments
- Supports merge, update, and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
4. Git LFS
Git Large File Storage (LFS) is an open-source project. It replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.
It allows you to version large files—even those as large as a couple GB in size—with Git, host more in your Git repositories with external storage, and to faster clone and fetch from repositories that deal with large files.
At the same time, you can keep your workflow and the same access controls and permissions for large files as the rest of your Git repository when working with a remote host like GitHub.
Dolt is a SQL database that you can fork, clone, branch, merge, push, and pull just like a git repository. Dolt allows data and schema to evolve together to make a version control database a better experience. It’s a great tool to collaborate on with your team.
You can freely connect to Dolt just like to any MySQL database to run queries or update the data using SQL commands.
Use the command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate’s changes.
All the commands you know for Git work exactly the same for Dolt. Git versions files, Dolt versions tables.
There’s also DoltHub – a place to share Dolt databases.
lakeFS is an open-source platform that provides a Git-like branching and committing model that scales to Petabytes of data by utilizing S3 or GCS for storage.
This branching model makes your data lake ACID-compliant by allowing changes to happen in isolated branches that can be created, merged, and rolled back atomically and instantly.
lakeFS has three main areas that let you focus on differen aspect of your ML models:
- Development Environment for Data: has tools that you can use to isolate snapshot of the lake you can experiment with while others are not exposed; reproducibility to compare changes and improve experiments
- Continuous Data Integration: entering and managing data according to your own rules
- Continuous Data Deployment: ability to quickly revert changes to data; providing consistency in your datasets; testing of production data to avoid cascading quality issues
lakeFS is a great tool for focusing on a specific area of your datasets to make ML experiments more consistent.
DVC is an open-source version control system for machine learning projects. It’s a tool that lets you define your pipeline regardless of the language you use.
When you find a problem in a previous version of your ML model, DVC saves your time by leveraging code data, and pipeline versioning, to give you reproducibility. You can also train your model and share it with your teammates via DVC pipelines.
DVC can cope with versioning and organization of big amounts of data and store them in a well-organized, accessible way. It focuses on data and pipeline versioning and management but also has some (limited) experiment tracking functionalities.
DVC – summary:
- Possibility to use different types of storage— it’s storage agnostic
- Full code and data provenance help to track the complete evolution of every ML model
- Reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment
- Tracking metrics
- A built-in way to connect ML steps into a DAG and run the full pipeline end-to-end
- Tracking failed attempts
- Runs on top of any Git repository and is compatible with any standard Git server or provider
DVC vs Neptune comparison
To wrap it up
Data versioning doesn’t have to be challenging. You can streamline it and minimize error occurrence by using the right tool. And with the best practices, it’ll help you achieve the best results and optimize processes.
So take your machine learning experiments to the next level and use a data version control tools!