Version control is used to track and manage changes in your project, including key parameters, documentation, application, and a lot more. Version Control Systems (VCS) come in handy when multiple people are working on the same code. Without VCS a software project would be chaos.
Version control is also called source control, or revision control. Why?
You’ll understand that after we explore:
Bookmark for later
What is Version Control?
Version Control is a way of tracking changes made to a codebase. In ML, the Version Control System enables smooth collaboration between developers and data scientists and allows you to restore your codebase to any previous version in case of problems.
The VCS helps teams collaborate on the same project without having too many complications. Developers and data scientists can view, manage, and merge changes wherever needed. New team members can adapt to the current environment easier.
Working on machine learning models, it’s always good to track trained datasets, hyperparameters, algorithm choices, architecture, and pipelines used to create the model.
Types of Version Control systems
We’ll discuss Centralized and Distributed systems and compare them in detail. You can also have a local version control system, with all commits saved on a local machine, but it’s highly error-prone, so we won’t get into it.
1. Centralized Version Control
In centralized version control, you have a single repository where everyone commits and updates the code at the same time. When someone updates the code, everyone in the team will know and will have to update the code in their workspace. Every change made goes to a central server (repository). Your change directly influences the master code, which will be in production. There’s no approval or merging system, it’s a simple commit and update.
- Centralized systems are easy to understand and manage.
- You get access to master-level code, no need to get approval from top authorities.
- It performs seamlessly with binary files.
2. Distributed Version Control
In Distributed version control, every developer gets his/her own repository and program copy. When you update the code, only you have access to it, until you push it to the central repository or master code. In simple terms, you commit and push the changes, other team members pull and update them. You don’t have to rely on the central repository all the time, you can clone the entire code to your local machine and work offline. Distributed version control also has a central repository, but it’s authoritative.
- Better performance and governance,
- Branching and merging is easier than in centralized version control,
- You can work offline.
Centralized and distributed Version Control differences
Easy to use as a beginner
Complicated for beginners
Can’t work offline
Work offline on your local machine
It’s difficult and time-consuming and requires direct server communication.
Faster, because you don’t have to communicate with the repository for every command.
In CVCS you don’t need to save the entire history or source code to your local machine, so you save a good amount of space.
If the project contains large files, DVCS will take more time and space on your local machine.
If the server goes down there’s no way to get back the code.
If the server goes down you have plenty of machines to get code from, including your local machine.
Why use Version Control
VCS lets you merge, change, and view all previous changes made to your program. It’s a good way to monitor program builds and things in development and production. For ML and AI-related work, it’s important to use VCS for non-binary files so that data scientists and ML developers can work in a central hub, without affecting other peoples’ work. Versioning is an important part of governance in the machine learning and AI world.
VCS makes it easy for developers to share, edit, view, and merge code from different places at different times. Without VCS, things get very complex and code can get buggy quickly. VCS becomes a common hub or central place where everyone on board can monitor what’s going on with the program.
Check what else can improve ML team collaboration
Say you made necessary changes to the program, and now you have it to merge into production code. It’s time to properly store those changes, saving the files with proper version names (eg- profile-redesign-v102). If you add long and complicated version names, you’ll get lost in the future. It’s best to keep it simple and in an alpha-numerical format with some semantic information, so it can be tracked easily whenever needed. Documentation on updates is very important in the software world. Without proper documentation or README files, developers lose track of progress. VCS keeps records of all changes made from day one, you can request any version you want and can compare it with the current version.
Review & backup
Every time you push or merge the code, you have to add short information on what changes were made. This will help everyone working on the program know that this particular file has changed and for this reason. A version control system is the best way to back up your files and code. You can get the complete updated code whenever you want without having issues.
While pushing a new version of a machine learning model, you come to know that it has some issue or fails at a particular point. Having all the records of your code will help you restore the previous code and give you enough time to work on issues, while the production code remains bug-free.
When you work with standard VCS, you only keep track of code files, documentation, and file dependencies. With machine learning it’s a different and complex thing, you have to keep track of your datasets, models, and different parameters. You have to keep an eye on which data was used for training and which for testing. Most of the time, multiple languages and framework tools are used in ML projects, so keeping track of dependencies becomes a critical task. Developers take time to roll out models to production until they’re confident in the performance. VCS helps developers roll out versions at the right time.
Versioning is an important part of machine learning and AI governance. It helps companies understand workflows, grant access rights, execute strategies, track and control all machine learning work and their results. If properly managed, the organization can collect any information that may affect the performance of the model.
With time, the data pipeline changes. You have to keep training your model, and this is where versioning comes in scope. You have to keep track of all metadata and results related to your ML workflow. Metadata will help you understand the nature of specific values in your dataset. It’s very important to take all the data into the model in the proper format, so the metadata will have to be properly versioned.
With VCS, you can connect data values to the respective metadata of a specific version. This helps developers find gaps in the dataset and values.
All ML project leaders want to improve the accuracy of their ML model(s). If it turns out that a previous version was more accurate, VCS lets you go back to that version quickly.
Continuous Integration (CI) is a method of implementing feature branches into master code, which can be built and tested automatically. This process helps developers locate issues and quickly find solutions. Fail fast, improve faster — this is what CI integration helps you do, and it improves the quality of your workflow.
How to do Version Control
Most popular VCS. Most developers around the world use Git. It tracks all the changes made to code files so that you have a record. You can easily return to a specific version if something goes wrong with the current code. Git has an easy to manage collaboration system.
Benefits of Git:
- Performance – it’s an easy to maintain and reliable version control system. You can view and compare changes made to the code over a specific time, and merge them.
- Flexibility – Git supports many kinds of development methods and workflows, it’s designed to handle small and large projects. It supports branching and tagging operations to store every change made to the project.
- Wide Acceptance – Git is universally accepted for its usability and performance standards.
Sandbox is an environment where teams can build, test and deploy programs. In a sandbox environment, code gets split into smaller groups or cells. Jupyter Notebook is the best tool for ML sandboxing. Sometimes a sandbox environment is the best way to test out IT solutions. Sandbox helps you in project integration, user demo and testing, and quality assurance. It has features for every stage of your machine learning lifecycle, not just the testing phase.
Benefits of Sandbox Environment:
- Easy to use workflow – develop, test, and deploy versions seamlessly. You can just edit a few lines of code and test them without pushing whole files/projects.
- Collaboration – sharing, and keeping everyone updated about the program is critical. You can get feedback and updates on the program with few permissions. You can work on reputable technologies without having any complications with architecture.
- Cost-Effective – sandbox environments are a pay-as-you-go service, so maintaining your workflow becomes easy and saves a lot of money for your company.
Data Version Control
Data Version Control (DVC) is an open-source tool used for machine learning and data science projects, it’s the best way to version your AI project. It’s similar to Git but has the option to track the steps, dependencies, data files, and a lot more. It helps you store large datasets, models, and files.
Benefits of DVC:
- Language-Independent – ML processes can seamlessly transform into reproducible pipelines for any language.
- Easy Installation – DVC is an open-source tool you can easily install with simple commands:
pip install dvc
- Data Sharing – you can share your ML data files on any cloud platform like Amazon Web Services or Google Cloud Platform.
- Data compliance – monitor data modification efforts as Git Pull Requests. Inspect the project’s well-established and unchangeable records to learn when datasets or models were approved, and why.
Versioning with Neptune
Neptune is a metadata store for MLOps, developed for research and production teams. It gives you a central hub to log, store, display, organizes, compares, and query all metadata generated during the machine learning lifecycle. Researchers and engineers use Neptune for experiment tracking and model registry to control their experimentation and model development.
Let’s start by initializing Neptune:
import neptune.new as neptune run = neptune.init('USER_NAME/PROJECT_NAME', api_token=’ANONYMOUS’)
File Data Version:
from neptunecontrib.versioning.data import log_data_version FILEPATH = '/path/to/data/my_data.csv' with neptune.create_experiment(): log_data_version(FILEPATH)
Folder Data Version:
from neptunecontrib.versioning.data import log_data_version DIRPATH = '/path/to/data/folder' with neptune.create_experiment(): log_data_version(DIRPATH)
S3 bucket data version:
(You can log the version and specific key, similar to file versioning)
BUCKET = 'my-bucket' PATH = 'training_dataset.csv' with neptune.create_experiment(): log_s3_data_version(BUCKET, PATH)
(You can track multiple data sources, just make sure you use prefix before logging)
from neptunecontrib.versioning.data import log_data_version FILEPATH_TABLE_1 = '/path/to/data/my_table_1.csv' FILEPATH_TABLE_2 = '/path/to/data/my_table_2.csv' with neptune.create_experiment(): log_data_version(FILEPATH_TABLE_1, prefix='table_1_') log_data_version(FILEPATH_TABLE_2, prefix='table_2_')
Logging Image directory snapshots with subfolders:
(With log_image_dir_snapshots you can log visual snapshots of your image directories; check this experiment for details information)
from neptunecontrib.versioning.data import log_image_dir_snapshots PATH = '/path/to/data/my_image_dir' with neptune.create_experiment(): log_image_dir_snapshots(PATH)
Version Control best practices
Use a good commit message
A good commit message will help other developers and team members understand what you updated. Adding a short and informative description will make things much easier to review.
Make each commit matter
Make sure that your commit serves a purpose, and isn’t just a backup of your work. Add new features, or fixing an issue. Try not to make multiple changes in a single commit, as this will be difficult to review.
In this process, you may find some bugs and want to fix them too. If You want to commit one file at a time, here are the commands for different VCS :
git commit filename1 filename2commits both named files.
svn commit filename filename2commits both files. It commits all the updated files into the directory.
If you want to save a file that serves multiple changes (for example multiple features), then you have to introduce them in logical chunks and make commits as you go. Each VCS supports this operation:
- Git: Store filename to a safe temporary location, then run git checkout filename to restore filename to its unchanged state.
- SVN: Moving filename to a temporary place, then run svn update filename to restore filename to its unchanged state.
Don’t break builds or do force commits
Try to Avoid breaking builds by performing full commits. You can prepare test cases for your commits and new APIs. This helps other team members to use those files without breaking the build. Let’s say you’re working on some API, it’s working pretty well on your machine but chances are it might break on a different machine. Plus, if the VCS isn’t pushing your update or displaying an error message, it’s best not to forcefully commit anything. This might break the whole workflow and mess things up.
Review and branching
Whenever you commit and push your code to a central repository, make sure you do a good review. This will help you understand the commit, and find issues or gaps in the commit.
Branching helps improve the workflow in many cases. It has some cons, but if you use good branching practices, you can make things easy for managing code, releases, milestones, issues, new features, and more.
Make traceable commits
For security and monitoring purposes, you need to store information about the commits, like reviewer remarks, information about commits, author details, and more. This will ensure that if needed, any commit can be backed out. Changes can be re-applied or updated after analyzing the issue.
Version Control Systems are the best way to monitor and track changes of all code files, datasets, dependencies, documentation, and other key assets of your project.
With a VCS, every team member is on the same page, collaboration is smoother, workflows are better, there are fewer errors, and development gradually speeds up.
Additional research and recommended reading
Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects
5 mins read | Jakub Czakon | Updated October 20th, 2021
Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.
You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.
It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.
It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.
So check out these top tools for data version control that can help you automate work and optimize processes.
Data versioning tools are critical to your workflow if you care about reproducibility, traceability, and ML model lineage.
They help you get a version of an artifact, a hash of the dataset or model that you can use to identify and compare it later. Often you’d log this data version into your metadata management solution to make sure your model training is versioned and reproducible.
How to choose a data versioning tool?
To choose a suitable data versioning tool for your workflow, you should check:
- Support for your data modality: how does it support video/audio? Does it provide some preview for tabular data?
- Ease of use: how easy is it to use in your workflow? How much overhead does it add to your execution?
- Diff and compare: Can you compare datasets? Can you see the diff for your image directory?
- How well does it work with your stack: Can you easily connect to your infrastructure, platform, or model training workflow?
- Can you get your team on board: If your team does not adopt it, it doesn’t matter how good the tool is. So keep your teammates skillset in mind and preferences in mind.
Here’re are a few tools worth exploring.Continue reading ->