We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Version Control Guide For Machine Learning Researchers

Version control is used to track and manage changes in your project, including key parameters, documentation, application, and a lot more. Version Control Systems (VCS) come in handy when multiple people are working on the same code. Without VCS a software project would be chaos.

Version control is also called source control, or revision control. Why?

What is Version Control?

Version Control is a way of tracking changes made to a codebase. In ML, the Version Control System enables smooth collaboration between developers and data scientists and allows you to restore your codebase to any previous version in case of problems.

The VCS helps teams collaborate on the same project without having too many complications. Developers and data scientists can view, manage, and merge changes wherever needed. New team members can adapt to the current environment easier. 

Working on machine learning models, it’s always good to track trained datasets, hyperparameters, algorithm choices, architecture, and pipelines used to create the model.

Version control tracking
Source: Author

Types of Version Control systems

We’ll discuss Centralized and Distributed systems and compare them in detail. You can also have a local version control system, with all commits saved on a local machine, but it’s highly error-prone, so we won’t get into it. 

1. Centralized Version Control

In centralized version control, you have a single repository where everyone commits and updates the code at the same time. When someone updates the code, everyone in the team will know and will have to update the code in their workspace. Every change made goes to a central server (repository). Your change directly influences the master code, which will be in production. There’s no approval or merging system, it’s a simple commit and update

Centralized version control
Source: Author
  • Centralized systems are easy to understand and manage.
  • You get access to master-level code, no need to get approval from top authorities. 
  • It performs seamlessly with binary files.

The most used centralized VCS systems are SVN and Perforce.

2. Distributed Version Control

In Distributed version control, every developer gets his/her own repository and program copy. When you update the code, only you have access to it, until you push it to the central repository or master code. In simple terms, you commit and push the changes, other team members pull and update them. You don’t have to rely on the central repository all the time, you can clone the entire code to your local machine and work offline. Distributed version control also has a central repository, but it’s authoritative.  

Distributed version control
Source: Author
  • Better performance and governance,
  • Branching and merging is easier than in centralized version control,
  • You can work offline.

The most popular distributed version control systems are Git and Mercurial.

Centralized and distributed Version Control differences

Centralized and distributed
Distributed Version Control

Easy to use as a beginner

Distributed Version Control:

Complicated for beginners

Can’t work offline

Distributed Version Control:

Work offline on your local machine

It’s difficult and time-consuming and requires direct server communication.

Distributed Version Control:

Faster, because you don’t have to communicate with the repository for every command.

In CVCS you don’t need to save the entire history or source code to your local machine, so you save a good amount of space.

Distributed Version Control:

If the project contains large files, DVCS will take more time and space on your local machine.

If the server goes down there’s no way to get back the code.

Distributed Version Control:

If the server goes down you have plenty of machines to get code from, including your local machine.

Why use Version Control

VCS lets you merge, change, and view all previous changes made to your program. It’s a good way to monitor program builds and things in development and production. For ML and AI-related work, it’s important to use VCS for non-binary files so that data scientists and ML developers can work in a central hub, without affecting other peoples’ work. Versioning is an important part of governance in the machine learning and AI world.

Collaboration

VCS makes it easy for developers to share, edit, view, and merge code from different places at different times. Without VCS, things get very complex and code can get buggy quickly. VCS becomes a common hub or central place where everyone on board can monitor what’s going on with the program. 

Storing versions

Say you made necessary changes to the program, and now you have it to merge into production code. It’s time to properly store those changes, saving the files with proper version names (eg- profile-redesign-v102). If you add long and complicated version names, you’ll get lost in the future. It’s best to keep it simple and in an alpha-numerical format with some semantic information, so it can be tracked easily whenever needed. Documentation on updates is very important in the software world. Without proper documentation or README files, developers lose track of progress. VCS keeps records of all changes made from day one, you can request any version you want and can compare it with the current version. 

Review & backup 

Every time you push or merge the code, you have to add short information on what changes were made. This will help everyone working on the program know that this particular file has changed and for this reason. A version control system is the best way to back up your files and code. You can get the complete updated code whenever you want without having issues.

Faster improvements

While pushing a new version of a machine learning model, you come to know that it has some issue or fails at a particular point. Having all the records of your code will help you restore the previous code and give you enough time to work on issues, while the production code remains bug-free. 

Complexity 

When you work with standard VCS, you only keep track of code files, documentation, and file dependencies. With machine learning it’s a different and complex thing, you have to keep track of your datasets, models, and different parameters. You have to keep an eye on which data was used for training and which for testing. Most of the time, multiple languages and framework tools are used in ML projects, so keeping track of dependencies becomes a critical task. Developers take time to roll out models to production until they’re confident in the performance. VCS helps developers roll out versions at the right time. 

Governance

Versioning is an important part of machine learning and AI governance. It helps companies understand workflows, grant access rights, execute strategies, track and control all machine learning work and their results. If properly managed, the organization can collect any information that may affect the performance of the model.

Data

With time, the data pipeline changes. You have to keep training your model, and this is where versioning comes in scope. You have to keep track of all metadata and results related to your ML workflow. Metadata will help you understand the nature of specific values in your dataset. It’s very important to take all the data into the model in the proper format, so the metadata will have to be properly versioned.

With VCS, you can connect data values to the respective metadata of a specific version. This helps developers find gaps in the dataset and values.

Model

All ML project leaders want to improve the accuracy of their ML model(s). If it turns out that a previous version was more accurate, VCS lets you go back to that version quickly.

Continuous Integration 

Continuous Integration (CI) is a method of implementing feature branches into master code, which can be built and tested automatically. This process helps developers locate issues and quickly find solutions. Fail fast, improve faster — this is what CI integration helps you do, and it improves the quality of your workflow.

See – Best 7 Data Version Control Tools That Improve Your Workflow with Machine Learning Projects

How to do Version Control 

Git

Most popular VCS. Most developers around the world use Git. It tracks all the changes made to code files so that you have a record. You can easily return to a specific version if something goes wrong with the current code. Git has an easy to manage collaboration system.

Version control - collaboration
Source: Author

Benefits of Git:

  • Performance – it’s an easy to maintain and reliable version control system. You can view and compare changes made to the code over a specific time, and merge them. 
  • Flexibility – Git supports many kinds of development methods and workflows, it’s designed to handle small and large projects. It supports branching and tagging operations to store every change made to the project.  
  • Wide Acceptance – Git is universally accepted for its usability and performance standards.  

Sandbox

Sandbox is an environment where teams can build, test and deploy programs. In a sandbox environment, code gets split into smaller groups or cells. Jupyter Notebook is the best tool for ML sandboxing. Sometimes a sandbox environment is the best way to test out IT solutions. Sandbox helps you in project integration, user demo and testing, and quality assurance. It has features for every stage of your machine learning lifecycle, not just the testing phase. 

Benefits of Sandbox Environment:

  • Easy to use workflow – develop, test, and deploy versions seamlessly. You can just edit a few lines of code and test them without pushing whole files/projects.
  • Collaboration – sharing, and keeping everyone updated about the program is critical. You can get feedback and updates on the program with few permissions. You can work on reputable technologies without having any complications with architecture. 
  • Cost-Effective – sandbox environments are a pay-as-you-go service, so maintaining your workflow becomes easy and saves a lot of money for your company. 

Data Version Control

Data Version Control (DVC) is an open-source tool used for machine learning and data science projects, it’s the best way to version your AI project. It’s similar to Git but has the option to track the steps, dependencies, data files, and a lot more. It helps you store large datasets, models, and files.

Benefits of DVC:

  • Language-Independent – ML processes can seamlessly transform into reproducible pipelines for any language.
  • Easy Installation – DVC is an open-source tool you can easily install with simple commands:
pip install dvc
  • Data Sharing – you can share your ML data files on any cloud platform like Amazon Web Services or Google Cloud Platform. 
  • Data compliance – monitor data modification efforts as Git Pull Requests. Inspect the project’s well-established and unchangeable records to learn when datasets or models were approved, and why.

Versioning with Neptune

Neptune is a metadata store for MLOps, developed for research and production teams. It gives you a central hub to log, store, display, organizes, compares, and query all metadata generated during the machine learning lifecycle. Researchers and engineers use Neptune for experiment tracking and model registry to control their experimentation and model development.

Model registry

Let’s start by initializing Neptune:

import neptune.new as neptune
run = neptune.init('USER_NAME/PROJECT_NAME',
                   api_token=’ANONYMOUS’)

File Data Version:

from neptunecontrib.versioning.data import log_data_version
FILEPATH = '/path/to/data/my_data.csv'
with neptune.create_experiment():
    log_data_version(FILEPATH)

Folder Data Version:

from neptunecontrib.versioning.data import log_data_version
DIRPATH = '/path/to/data/folder'
with neptune.create_experiment():
log_data_version(DIRPATH)

S3 bucket data version:
(You can log the version and specific key, similar to file versioning)

BUCKET = 'my-bucket'
PATH = 'training_dataset.csv'
with neptune.create_experiment():
    log_s3_data_version(BUCKET, PATH)

Prefixing:
(You can track multiple data sources, just make sure you use prefix before logging)

from neptunecontrib.versioning.data import log_data_version
FILEPATH_TABLE_1 = '/path/to/data/my_table_1.csv'
FILEPATH_TABLE_2 = '/path/to/data/my_table_2.csv'
with neptune.create_experiment():
log_data_version(FILEPATH_TABLE_1, prefix='table_1_')
log_data_version(FILEPATH_TABLE_2, prefix='table_2_')

Logging Image directory snapshots with subfolders:
(With log_image_dir_snapshots you can log visual snapshots of your image directories; check this experiment for details information)

from neptunecontrib.versioning.data import log_image_dir_snapshots
PATH = '/path/to/data/my_image_dir'
with neptune.create_experiment():
log_image_dir_snapshots(PATH)

Version Control best practices

Use a good commit message

A good commit message will help other developers and team members understand what you updated. Adding a short and informative description will make things much easier to review.

Make each commit matter

Make sure that your commit serves a purpose, and isn’t just a backup of your work. Add new features, or fixing an issue. Try not to make multiple changes in a single commit, as this will be difficult to review.

In this process, you may find some bugs and want to fix them too. If You want to commit one file at a time, here are the commands for different VCS :

  • Git: git commit filename1 filename2 commits both named files.
  • SVN: svn commit filename filename2 commits both files. It commits all the updated files into the directory.

If you want to save a file that serves multiple changes (for example multiple features), then you have to introduce them in logical chunks and make commits as you go. Each VCS supports this operation:

  • Git: Store filename to a safe temporary location, then run git checkout filename to restore filename to its unchanged state.
  • SVN: Moving filename to a temporary place, then run svn update filename to restore filename to its unchanged state.

Don’t break builds or do force commits 

Try to Avoid breaking builds by performing full commits. You can prepare test cases for your commits and new APIs. This helps other team members to use those files without breaking the build. Let’s say you’re working on some API, it’s working pretty well on your machine but chances are it might break on a different machine. Plus, if the VCS isn’t pushing your update or displaying an error message, it’s best not to forcefully commit anything. This might break the whole workflow and mess things up. 

Review and branching

Whenever you commit and push your code to a central repository, make sure you do a good review. This will help you understand the commit, and find issues or gaps in the commit. 

Branching helps improve the workflow in many cases. It has some cons, but if you use good branching practices, you can make things easy for managing code, releases, milestones, issues, new features, and more. 

Make traceable commits 

For security and monitoring purposes, you need to store information about the commits, like reviewer remarks, information about commits, author details, and more. This will ensure that if needed, any commit can be backed out. Changes can be re-applied or updated after analyzing the issue. 

Conclusion

Version Control Systems are the best way to monitor and track changes of all code files, datasets, dependencies, documentation, and other key assets of your project. 

With a VCS, every team member is on the same page, collaboration is smoother, workflows are better, there are fewer errors, and development gradually speeds up.

Additional research and recommended reading


READ NEXT

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

5 mins read | Jakub Czakon | Updated October 20th, 2021

Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.

You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.

It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.

It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.

So check out these top tools for data version control that can help you automate work and optimize processes.

Data versioning tools are critical to your workflow if you care about reproducibility, traceability, and ML model lineage. 

They help you get a version of an artifact, a hash of the dataset or model that you can use to identify and compare it later. Often you’d log this data version into your metadata management solution to make sure your model training is versioned and reproducible.

How to choose a data versioning tool?

To choose a suitable data versioning tool for your workflow, you should check:

  • Support for your data modality: how does it support video/audio? Does it provide some preview for tabular data?
  • Ease of use: how easy is it to use in your workflow? How much overhead does it add to your execution?
  • Diff and compare: Can you compare datasets? Can you see the diff for your image directory?
  • How well does it work with your stack: Can you easily connect to your infrastructure, platform, or model training workflow?
  • Can you get your team on board: If your team does not adopt it, it doesn’t matter how good the tool is. So keep your teammates skillset in mind and preferences in mind. 

Here’re are a few tools worth exploring.

Continue reading ->
Why You Should Use Continuous Integration and Continuous Deployment in Your Machine Learning Project

Why You Should Use Continuous Integration and Continuous Deployment in Your Machine Learning Projects

Read more
Model Registry Makes MLOps Work Here’s Why

Model Registry Makes MLOps Work: Here’s Why

Read more
Best MLOps tools

The Best MLOps Tools and How to Evaluate Them

Read more
MLflow model registry - alternatives

Best Alternatives to MLflow Model Registry

Read more