MLOps Blog

Version Control Guide For Machine Learning Researchers

6 min
Harshil Patel
25th April, 2023

Version control is used to track and manage changes in your project, including key parameters, documentation, application, and a lot more. Version Control Systems (VCS) come in handy when multiple people are working on the same code. Without VCS a software project would be chaos.

Version control is also called source control, or revision control. Why?

Bookmark for later

Best 7 Data Version Control Tools That Improve Your Workflow with Machine Learning Projects

Managing Dataset Versions in Long-Term ML Projects

What is Version Control?

Version Control is a way of tracking changes made to a codebase. In ML, the Version Control System enables smooth collaboration between developers and data scientists and allows you to restore your codebase to any previous version in case of problems.

The VCS helps teams collaborate on the same project without having too many complications. Developers and data scientists can view, manage, and merge changes wherever needed. New team members can adapt to the current environment easier. 

Working on machine learning models, it’s always good to track trained datasets, hyperparameters, algorithm choices, architecture, and pipelines used to create the model.

Version control tracking
Source: Author

Types of Version Control systems

We’ll discuss Centralized and Distributed systems and compare them in detail. You can also have a local version control system, with all commits saved on a local machine, but it’s highly error-prone, so we won’t get into it. 

1. Centralized Version Control

In centralized version control, you have a single repository where everyone commits and updates the code at the same time. When someone updates the code, everyone in the team will know and will have to update the code in their workspace. Every change made goes to a central server (repository). Your change directly influences the master code, which will be in production. There’s no approval or merging system, it’s a simple commit and update

Centralized version control
Source: Author
  • Centralized systems are easy to understand and manage.
  • You get access to master-level code, no need to get approval from top authorities. 
  • It performs seamlessly with binary files.

The most used centralized VCS systems are SVN and Perforce.

2. Distributed Version Control

In Distributed version control, every developer gets his/her own repository and program copy. When you update the code, only you have access to it, until you push it to the central repository or master code. In simple terms, you commit and push the changes, other team members pull and update them. You don’t have to rely on the central repository all the time, you can clone the entire code to your local machine and work offline. Distributed version control also has a central repository, but it’s authoritative.  

Distributed version control
Source: Author
  • Better performance and governance,
  • Branching and merging is easier than in centralized version control,
  • You can work offline.

The most popular distributed version control systems are Git and Mercurial.

Centralized and distributed Version Control differences

Centralized and distributed
Distributed Version Control

Easy to use as a beginner

Complicated for beginners

Can’t work offline

Work offline on your local machine

It’s difficult and time-consuming and requires direct server communication.

Faster, because you don’t have to communicate with the repository for every command.

In CVCS you don’t need to save the entire history or source code to your local machine, so you save a good amount of space.

If the project contains large files, DVCS will take more time and space on your local machine.

If the server goes down there’s no way to get back the code.

If the server goes down you have plenty of machines to get code from, including your local machine.

Why use Version Control

VCS lets you merge, change, and view all previous changes made to your program. It’s a good way to monitor program builds and things in development and production. For ML and AI-related work, it’s important to use VCS for non-binary files so that data scientists and ML developers can work in a central hub, without affecting other peoples’ work. Versioning is an important part of governance in the machine learning and AI world.

Collaboration

VCS makes it easy for developers to share, edit, view, and merge code from different places at different times. Without VCS, things get very complex and code can get buggy quickly. VCS becomes a common hub or central place where everyone on board can monitor what’s going on with the program. 

Check what else can improve ML team collaboration

The Best Software for Collaborating on Machine Learning Projects

Storing versions

Say you made necessary changes to the program, and now you have it to merge into production code. It’s time to properly store those changes, saving the files with proper version names (eg- profile-redesign-v102). If you add long and complicated version names, you’ll get lost in the future. It’s best to keep it simple and in an alpha-numerical format with some semantic information, so it can be tracked easily whenever needed. Documentation on updates is very important in the software world. Without proper documentation or README files, developers lose track of progress. VCS keeps records of all changes made from day one, you can request any version you want and can compare it with the current version. 

Review & backup 

Every time you push or merge the code, you have to add short information on what changes were made. This will help everyone working on the program know that this particular file has changed and for this reason. A version control system is the best way to back up your files and code. You can get the complete updated code whenever you want without having issues.

Faster improvements

While pushing a new version of a machine learning model, you come to know that it has some issue or fails at a particular point. Having all the records of your code will help you restore the previous code and give you enough time to work on issues, while the production code remains bug-free. 

Complexity 

When you work with standard VCS, you only keep track of code files, documentation, and file dependencies. With machine learning it’s a different and complex thing, you have to keep track of your datasets, models, and different parameters. You have to keep an eye on which data was used for training and which for testing. Most of the time, multiple languages and framework tools are used in ML projects, so keeping track of dependencies becomes a critical task. Developers take time to roll out models to production until they’re confident in the performance. VCS helps developers roll out versions at the right time. 

Governance

Versioning is an important part of machine learning and AI governance. It helps companies understand workflows, grant access rights, execute strategies, track and control all machine learning work and their results. If properly managed, the organization can collect any information that may affect the performance of the model.

Data

With time, the data pipeline changes. You have to keep training your model, and this is where versioning comes in scope. You have to keep track of all metadata and results related to your ML workflow. Metadata will help you understand the nature of specific values in your dataset. It’s very important to take all the data into the model in the proper format, so the metadata will have to be properly versioned.

With VCS, you can connect data values to the respective metadata of a specific version. This helps developers find gaps in the dataset and values.

Model

All ML project leaders want to improve the accuracy of their ML model(s). If it turns out that a previous version was more accurate, VCS lets you go back to that version quickly.

Continuous Integration 

Continuous Integration (CI) is a method of implementing feature branches into master code, which can be built and tested automatically. This process helps developers locate issues and quickly find solutions. Fail fast, improve faster — this is what CI integration helps you do, and it improves the quality of your workflow.

See – Best 7 Data Version Control Tools That Improve Your Workflow with Machine Learning Projects

How to do Version Control 

Git

Most popular VCS. Most developers around the world use Git. It tracks all the changes made to code files so that you have a record. You can easily return to a specific version if something goes wrong with the current code. Git has an easy to manage collaboration system.

Version control - collaboration
Source: Author

Benefits of Git:

  • Performance – it’s an easy to maintain and reliable version control system. You can view and compare changes made to the code over a specific time, and merge them. 
  • Flexibility – Git supports many kinds of development methods and workflows, it’s designed to handle small and large projects. It supports branching and tagging operations to store every change made to the project.  
  • Wide Acceptance – Git is universally accepted for its usability and performance standards.  

Sandbox

Sandbox is an environment where teams can build, test and deploy programs. In a sandbox environment, code gets split into smaller groups or cells. Jupyter Notebook is the best tool for ML sandboxing. Sometimes a sandbox environment is the best way to test out IT solutions. Sandbox helps you in project integration, user demo and testing, and quality assurance. It has features for every stage of your machine learning lifecycle, not just the testing phase. 

Benefits of Sandbox Environment:

  • Easy to use workflow – develop, test, and deploy versions seamlessly. You can just edit a few lines of code and test them without pushing whole files/projects.
  • Collaboration – sharing, and keeping everyone updated about the program is critical. You can get feedback and updates on the program with few permissions. You can work on reputable technologies without having any complications with architecture. 
  • Cost-Effective – sandbox environments are a pay-as-you-go service, so maintaining your workflow becomes easy and saves a lot of money for your company. 

Data Version Control

Data Version Control (DVC) is an open-source tool used for machine learning and data science projects, it’s the best way to version your AI project. It’s similar to Git but has the option to track the steps, dependencies, data files, and a lot more. It helps you store large datasets, models, and files.

Benefits of DVC:

  • Language-Independent – ML processes can seamlessly transform into reproducible pipelines for any language.
  • Easy Installation – DVC is an open-source tool you can easily install with simple commands:
pip install dvc
  • Data Sharing – you can share your ML data files on any cloud platform like Amazon Web Services or Google Cloud Platform. 
  • Data compliance – monitor data modification efforts as Git Pull Requests. Inspect the project’s well-established and unchangeable records to learn when datasets or models were approved, and why.

Versioning with Neptune

Neptune is a metadata store for MLOps, developed for research and production teams. It gives you a central hub to log, store, display, organizes, compares, and query all metadata generated during the machine learning lifecycle. Researchers and engineers use Neptune for experiment tracking and model registry to control their experimentation and model development.

With the track_files() method, you can log metadata about datasets, models, and any other artifacts that can be stored as files.

For example:

# Single file
run["train/dataset"].track_files("./datasets/train.csv")

# Folder
run["train/images"].track_files("./datasets/images")

You can later see it in the Neptune app:

You can version datasets or models stored on Amazon S3 or compatible storage (s3://...), such as MinIO or Google Cloud Storage (GCS).

But that’s not all. You can also track:

Version Control best practices

Use a good commit message

A good commit message will help other developers and team members understand what you updated. Adding a short and informative description will make things much easier to review.

Make each commit matter

Make sure that your commit serves a purpose, and isn’t just a backup of your work. Add new features, or fixing an issue. Try not to make multiple changes in a single commit, as this will be difficult to review.

In this process, you may find some bugs and want to fix them too. If You want to commit one file at a time, here are the commands for different VCS :

  • Git: git commit filename1 filename2 commits both named files.
  • SVN: svn commit filename filename2 commits both files. It commits all the updated files into the directory.

If you want to save a file that serves multiple changes (for example multiple features), then you have to introduce them in logical chunks and make commits as you go. Each VCS supports this operation:

  • Git: Store filename to a safe temporary location, then run git checkout filename to restore filename to its unchanged state.
  • SVN: Moving filename to a temporary place, then run svn update filename to restore filename to its unchanged state.

Don’t break builds or do force commits 

Try to Avoid breaking builds by performing full commits. You can prepare test cases for your commits and new APIs. This helps other team members to use those files without breaking the build. Let’s say you’re working on some API, it’s working pretty well on your machine but chances are it might break on a different machine. Plus, if the VCS isn’t pushing your update or displaying an error message, it’s best not to forcefully commit anything. This might break the whole workflow and mess things up. 

Review and branching

Whenever you commit and push your code to a central repository, make sure you do a good review. This will help you understand the commit, and find issues or gaps in the commit. 

Branching helps improve the workflow in many cases. It has some cons, but if you use good branching practices, you can make things easy for managing code, releases, milestones, issues, new features, and more. 

Make traceable commits 

For security and monitoring purposes, you need to store information about the commits, like reviewer remarks, information about commits, author details, and more. This will ensure that if needed, any commit can be backed out. Changes can be re-applied or updated after analyzing the issue. 

Conclusion

Version Control Systems are the best way to monitor and track changes of all code files, datasets, dependencies, documentation, and other key assets of your project. 

With a VCS, every team member is on the same page, collaboration is smoother, workflows are better, there are fewer errors, and development gradually speeds up.

Additional research and recommended reading