Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.
You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.
It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.
It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.
So check out these top tools for data version control that can help you automate work and optimize processes.
Neptune is a lightweight experiment management and collaboration tool. It is very flexible, works with many other frameworks, and has a stable user interface so you can effectively systematize your ML experiments and improve management.
It’s a robust software that can store, retrieve, and analyze a large amount of data. Neptune has all the tools for efficient team collaboration and data version control. Everything in one place!
Here’s Neptune in a nutshell:
- Provides user and organization management with different organization, projects, and user roles to easily and quickly keep track of changes
- Intuitive UI with a lot of capabilities to organize runs in groups, save custom dashboard views and share them with the team
- Allows you to monitor resources and compare experiments interactively to collaborate more effectively
- You can use a hosted app to avoid all the hassle with maintaining yet another tool (or have it deployed on your on-prem infrastructure)
- Your team can track experiments which are executed in scripts (Python, R, other), notebooks (local, Google Colab, AWS SageMaker) and do that on any infrastructure (cloud, laptop, cluster)
- Rich experiment tracking and visualization capabilities (resource consumption, scrolling through lists of images)
Pachyderm is a complete version-controlled data science platform that helps to control an end-to-end machine learning life cycle. It comes in three different versions, Community Edition (open-source, with the ability to be deployed anywhere), Enterprise Edition (complete version-controlled platform), and Hub Edition (a hosted version, still in beta).
It’s a great platform for flexible collaboration on any kind of machine learning project.
The Best Pachyderm Alternatives
Here’s what you can do with Pachyderm as a data version tool:
- Pachyderm lets you continuously update data in the master branch of your repo, while experimenting with specific data commits in a separate branch or branches
- It supports any type, size, and number of files including binary and plain text files
- Pachyderm commits are centralized and transactional
- Provenance enables teams to build on each other work, share, transform, and update datasets while automatically maintaining a complete audit trail so that all results are reproducible
3. Delta Lake
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake – summary:
- Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
- Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
- Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
- Serializable isolation levels ensure that readers never see inconsistent data.
- Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments
- Supports merge, update, and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
4. Git LFS
Git Large File Storage (LFS) is an open-source project. It replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.
It allows you to version large files—even those as large as a couple GB in size—with Git, host more in your Git repositories with external storage, and to faster clone and fetch from repositories that deal with large files.
At the same time, you can keep your workflow and the same access controls and permissions for large files as the rest of your Git repository when working with a remote host like GitHub.
Dolt is a SQL database that you can fork, clone, branch, merge, push, and pull just like a git repository. Dolt allows data and schema to evolve together to make a version control database a better experience. It’s a great tool to collaborate on with your team.
You can freely connect to Dolt just like to any MySQL database to run queries or update the data using SQL commands.
Use the command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate’s changes.
All the commands you know for Git work exactly the same for Dolt. Git versions files, Dolt versions tables.
There’s also DoltHub – a place to share Dolt databases.
lakeFS is an open-source platform that provides a Git-like branching and committing model that scales to Petabytes of data by utilizing S3 or GCS for storage.
This branching model makes your data lake ACID-compliant by allowing changes to happen in isolated branches that can be created, merged, and rolled back atomically and instantly.
lakeFS has three main areas that let you focus on differen aspect of your ML models:
- Development Environment for Data: has tools that you can use to isolate snapshot of the lake you can experiment with while others are not exposed; reproducibility to compare changes and improve experiments
- Continuous Data Integration: entering and managing data according to your own rules
- Continuous Data Deployment: ability to quickly revert changes to data; providing consistency in your datasets; testing of production data to avoid cascading quality issues
lakeFS is a great tool for focusing on a specific area of your datasets to make ML experiments more consistent.
DVC is an open-source version control system for machine learning projects. It’s a tool that lets you define your pipeline regardless of the language you use.
When you find a problem in a previous version of your ML model, DVC saves your time by leveraging code data, and pipeline versioning, to give you reproducibility. You can also train your model and share it with your teammates via DVC pipelines.
DVC can cope with versioning and organization of big amounts of data and store them in a well-organized, accessible way. It focuses on data and pipeline versioning and management but also has some (limited) experiment tracking functionalities.
DVC – summary:
- Possibility to use different types of storage— it’s storage agnostic
- Full code and data provenance help to track the complete evolution of every ML model
- Reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment
- Tracking metrics
- A built-in way to connect ML steps into a DAG and run the full pipeline end-to-end
- Tracking failed attempts
- Runs on top of any Git repository and is compatible with any standard Git server or provider
DVC vs Neptune comparison
To wrap it up
Data versioning doesn’t have to be challenging. You can streamline it and minimize error occurrence by using the right tool. And with the best practices, it’ll help you achieve the best results and optimize processes.
So take your machine learning experiments to the next level and use a data version control tools!
Get started with Neptune in 5 minutes
If you are looking for an experiment tracking tool you may want to take a look at Neptune.
It takes literally 5 minutes to set up and as one of our happy users said:
“Within the first few tens of runs, I realized how complete the tracking was – not just one or two numbers, but also the exact state of the code, the best-quality model snapshot stored to the cloud, the ability to quickly add notes on a particular experiment. My old methods were such a mess by comparison.” – Edward Dixon, Data Scientist @intel
To get started follow these 4 simple steps.
Install the client library.
pip install neptune-client
Connect to the tool by adding a snippet to your training code.
import neptune neptune.init(...) # credentials neptune.create_experiment() # start logger
Specify what you want to log:
neptune.log_metric('accuracy', 0.92) for prediction_image in worst_predictions: neptune.log_image('worst predictions', prediction_image)
Run your experiment as you normally would:
And that’s it!
Your experiment is logged to a central experiment database and displayed in the experiment dashboard, where you can search, compare, and drill down to whatever information you need.Get your free account ->