As a Data Scientist or Machine Learning engineer, you probably experienced this:
You start writing code in your Jupyter Notebook, do your exploration, pre-process your data, build your models and save them… but then you lose track of which code and notebook version was the best.
There are conventional tools, like git and other source code and notebook versioning systems, but they’re not particularly suited for improving the productivity of ML teams during the development and experimentation stage.
That’s why in this article, I’ll be listing tools that will give you a more organized platform for you and your team to own and work on different versions of your experimentation notebook, increasing your productivity and efficiency (including that of your entire team).
By the end of this article, you should have:
- An understanding of the need to version your notebooks.
- Knowledge of the factors to consider when choosing your tool for versioning notebooks.
- Awareness of the top 5 tools for versioning your notebooks and code.
Key factors to consider when choosing a notebook versioning tool
When choosing a notebook versioning tool or platform, the following are the most important points to consider:
- Notebook Checkpointing/Commits: You should be able to revert back to previous checkpoints in the notebook you versioned, and compare changes across the different checkpoints.
- Compare Different Notebooks: You should be able to compare differences and changes made between different notebooks.
- Team Collaboration: You should be able to efficiently work with teammates on various Notebook versions through sharing, in-notebook comments and chat features (so unnecessary meetings can be avoided), results descriptions from the notebook, direct links between notebook versions and experiments carried out by the team, and other collaborative features.
- Pricing Options: Is the tool/platform free or paid? What are the options and different plans for the paid tools? How do they stack up against the open-source tools?
Top 5 tools for versioning your notebooks
Here are the top 5 tools for versioning your Notebooks to effectively manage experiments.
Notebook Checkpointing/ Committing
Provides support for storing notebook checkpoints.
Integrates well with GitHub but does not store checkpoint files.
Stores notebook checkpoint and tracks new commits.
Tracks each new checkpoint and commit.
Does not store or track checkpoint files or commits.
Notebook Comparison Feature
Provides support to compare different notebooks and notebook versions.
Tracks changes only in one notebook but does not provide cross-notebook version comparison.
Provides notebook diffs support for notebooks but not cross-notebook version comparison.
Can compare different notebooks and versions, and even merge all changes into one notebook.
Does not provide notebook comparison feature.
Easily share and track notebook data from notebook author to description, time of last checkpoint saved and associated experiments.
Great for notebook code review and seamless team collaboration.
Limited notebook link sharing options, only AWS authorized users, cannot comment within Notebook for team communication.
Does not support link sharing but integrates well with existing Git workflows.
Team collaboration is good for a Machine Learning project management SaaS.
Free for individual users, paid for teams.
Free for open source and educational purposes, starts at $79 for a team of up to 10 users.
Free for AWS users but you pay for compute instances to run your notebook workloads.
Open source; completely free.
Free limited features for individual users and starts at $179 for a team of up to 5 users.
Now, let’s review them in details.
If you’re looking for a platform that can manage all your experiment metadata and your team’s development process, look no further than Neptune.ai. Neptune is now a metadata store for MLOps, built for research and production teams that run a lot of experiments. It’s also very useful for managing different notebook versions, and showing you the different notebook versions and who created them – all in one tab. The Neptune.ai client can help you manage your on-prem notebooks, too.
Notebook checkpointing/commiting using Neptune.ai
Neptune.ai does not just manage various notebook versions efficiently, it also manages the checkpoint file for each notebook so you can track changes and differences across checkpoints.
You can also compare the differences and changes made across two checkpoints.
Compare different notebooks with the Neptune.ai platform
Beyond comparing checkpoints, you can also compare different notebook versions from different or same experiments. This way, you and your team can make the most informed decisions, including the other features like tracking logs, metrics, and other metadata from experiments, all offered by Neptune.ai
Team collaboration with Neptune.ai
Apart from having all your various notebook versions in one place, including the metadata (execution time, experiment metrics), Neptune.ai keeps you organized and collaborative with a git-like format by adding descriptions to your different notebook versions.
Neptune.ai provides a rich and visual dashboard that gives you and your team an idea of who worked on any particular phase of your ML experimentation notebook, the time they last worked on it, and their easy-to-read description of the notebook so you can see everything at a glance.
You can share your notebook version dashboard with anyone on your team. This can help your team stay productive and get the most relevant details from a notebook version quickly.
Neptune.ai can also be hosted on-premise for your team to collaborate if there is a priority need to run all your workloads on-premise.
Pricing options for Neptune.ai
Now you may think that this efficient notebook versioning platform comes at a hefty cost, right? Luckily, it doesn’t. If you want to use the platform for your personal project, or for work or research, Neptune.ai gives you 100GB of storage for free to store your experiment notebook and metadata.
If you have an educational organization or a non-profit, Neptune offers their entire platform to your team for free. For commercial organizations, it is paid.
If you work at a large company and worry about the monthly cost, you can always contact Neptune.ai for a more customized plan. Check the pricing options here.
Getting started with Neptune.ai
ReviewNB is one of the most popular Jupyter Notebook tools in the Data Science community. It’s a GitHub-verified application that provides a complete code review workflow for notebooks. It integrates well with Git and Github, and is purposefully built around reviewing code in Notebooks.
Because ReviewNB integrates well with GitHub, it provides rich differences between various checkpoints, commits (even pull requests) when compared to one another. You can see rich notebook diffs for any commit or pull request directly from you or anyone else on your team.
Compare different notebooks using ReviewNB
With ReviewNB, unfortunately, you can’t compare the changes between different notebooks, but can only do so in one specific notebook. For example, you may want to compare one version of a notebook that uses a specific type of pre-processing vs another that uses a different type. This is not possible in ReviewNB (Neptune does it if you need this).
Team collaboration using ReviewNB
ReviewNB was built with team collaboration in mind. In fact, on their homepage, you will notice that the platform is streamlined for data science collaboration. With features like code reviews, ReviewNB provides data science teams with the ease of writing a comment on any notebook cell, asking clarifying questions, and providing feedback in the context of a notebook cell.
For each new comment or review opened, teams can get notified directly in their email. You can comment on the notebook diff to discuss changes, open a thread of discussion on each new review, and track conversation threads.
ReviewNB can also be hosted on-premise for your team to collaborate if you need to run all your workloads on-prem. To get set up, you can request the installation here.
Pricing options for ReviewNB
ReviewNB is free for open source use and for educational purposes (if you have a .edu email, you can also sign-up for the free offering). It’s $79 per month for a small team of up to 10 users, and $249 per month for a business of up to 30 users. As with the Neptune.ai option, for larger organizations with unlimited features, you can contact the team.
Getting started with ReviewNB
As per the official website, Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps, improving data science team productivity by up to 10x.
SageMaker Studio gives you complete access, control, and visibility into each step required to build, train, and deploy models. Obviously, it’s powered by AWS Cloud, so you can only access it through proper AWS authorization.
Notebook checkpointing/committing using AWS SageMaker Studio
Just like your regular Jupyter Notebooks, SageMaker studio lets you save notebook checkpoints manually, so you can revert to them as needed. The checkpoint files are stored in an S3 bucket, created automatically for you when you create a new environment. You can use the checkpoint diff feature to view changes between your latest notebook version and the checkpoint file. All the notebook metadata is also managed by Amazon SageMaker.
If a notebook is opened from a Git repository, you can view the difference between the notebook and the last Git commit.
Compare different notebooks using AWS SageMaker Studio
Also with AWS SageMaker studio, you can’t compare the changes between different notebooks, but can only do so in one specific notebook by using any of the notebook difference features (in Studio).
📌 Check what are the differences between SageMaker Sudio and Neptune
Team collaboration using AWS SageMaker Studio
With AWS SageMaker, you can carry out all the ML development activities including notebooks, experiment management, automatic model creation, debugging, and model and data drift detection. You can share your Amazon SageMaker Studio notebooks with your teammates. The shared notebook is a copy, and you can’t modify the original notebook that was shared.
For your teammates to make changes, they have to create a new copy of your original notebook. If you want to share your latest version, you must create a new snapshot and then share it. You can also decide not to include the output of your cell runs.
It has a nice GitHub integration feature where you can include the link to the Git repository that contains the notebook, and then you and your colleagues can collaborate and contribute to the same Git repository. Your sharing option is also quite secure, because you can provide IAM authentication for specific colleagues you want to be able to access your shared Notebook.
With SageMaker studio, you don’t get the leisure of leaving comments in a notebook shared by your colleague, like we saw with Neptune.ai and ReviewNB. If you want to make changes, you’ll have to create a new copy of the notebook, which is quite stressful and inefficient in itself.
Pricing options for AWS SageMaker Studio
Using AWS Studio SageMaker is free-of-charge, but you pay the standard sagemaker price for the compute instance you run your notebook on.
Getting started with AWS SageMaker Studio
To access AWS SagMaker Studio, you have to have an AWS account with IAM authorized access to SageMaker. You can watch this webinar by Julien Simons to understand other SageMaker studio offerings. You may want to check the pricing to see if your free tier or organization billing supports any of the standard instances.
nbdime provides tools for diffing and merging of Jupyter Notebooks. It’s completely open source and works locally. It’s easy to install with very useful documentation and has an active community.
Notebook checkpointing/commits using nbdime
Nbdime has good integration with `git` version control and you can set this up to work with a local `git` client, so that git diff & git merge commands use nbdime for .ipynb files. With the git integration, you can work with Jupyter Notebooks locally, and run `git diff` to see how the notebook has changed before making that commit to your repository.
Compare different notebooks using nbdime
With nbdime, you can compare the difference between two or more notebook versions in a terminal-friendly way, and even merge multiple notebooks with conflict automatically resolved. You also get a rich visual experience when comparing differences between notebooks, charts, or other data.
Team collaboration using nbdime
With its impressive integration with `git` and the `nbmerge` three-way merge of notebooks with automatic conflict resolution, `nbdime` is perhaps the best completely free version control tool for Jupyter Notebooks, especially for teams already familiar with git.
Pricing options for nbdime
`nbdime` is completely open source and free for teams of all sizes. The only disadvantage is that you only get to use it locally, but if your and/or your team already have git incorporated into your workflow, it should not be a hassle to adopt.
Getting started with `nbdime`
`nbdime` is very easy to install. The best place to get started is the documentation.
Our final tool for versioning your code and notebooks is Comet. Comet let’s data science teams track, compare, explain, and optimize their experiments and models, bringing efficiency, transparency, and reproducibility to machine learning. It provides a self-hosted and cloud-based meta machine learning platform, letting data scientists and teams track, compare, explain and optimize experiments and models.
📌 Explore the differences between Comet and Neptune
Notebook checkpointing/committing using Comet
In terms of the experience with versioning your notebook, Comet comes up short to other options listed above. Your checkpoint files are also not saved, in contrast to Neptune.ai.
I’d say Comet is actually the best tool (from the set of tools on this list) for versioning Python code, and comparing differences in changes made across your various code experiments (which is great for when you want to push your experiments to production). However, for notebooks where you have various teams working on the different project phases from data exploration to model building, it falls short of Neptune.ai on providing a rich Jupyter Notebook experience–which may make team collaboration and communication difficult.
Team collaboration using Comet
Comet is one of the best tools for managing machine learning projects in development and in production. You can share your versioned notebooks with your teammates, but the functionality and support for Jupyter Notebook is limited. You only have limited information on who authored what notebook, the time of execution, the various versions, and so on. Although Comet saves execution-ordered code from experiments run using Jupyter notebooks, collaborators can easily reproduce any experiment.
Comet automatically integrates with your project’s git repository, so if your team is already used to a `git` workflow, it will be easy to integrate Comet.
Pricing options for Comet
Comet is the most expensive platform for all the options listed above. While it is free for a single user, if you have a team and want extra functionalities, you may be paying between $179 and $249 per month.
You may want to consider your options carefully with this one. You can check the pricing page here for more information.
Getting started with Comet
Final thoughts and next steps
I do hope this article has helped you understand the need for versioning your experiment notebooks and choosing a platform that either works for you, or for you and your team.
I encourage you to choose a platform to work with for the next few days, based on your needs, and see if it works for you and/or your team.
Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects
5 mins read | Jakub Czakon | Updated October 20th, 2021
Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.
You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.
It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.
It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.
So check out these top tools for data version control that can help you automate work and optimize processes.
Data versioning tools are critical to your workflow if you care about reproducibility, traceability, and ML model lineage.
They help you get a version of an artifact, a hash of the dataset or model that you can use to identify and compare it later. Often you’d log this data version into your metadata management solution to make sure your model training is versioned and reproducible.
How to choose a data versioning tool?
To choose a suitable data versioning tool for your workflow, you should check:
- Support for your data modality: how does it support video/audio? Does it provide some preview for tabular data?
- Ease of use: how easy is it to use in your workflow? How much overhead does it add to your execution?
- Diff and compare: Can you compare datasets? Can you see the diff for your image directory?
- How well does it work with your stack: Can you easily connect to your infrastructure, platform, or model training workflow?
- Can you get your team on board: If your team does not adopt it, it doesn’t matter how good the tool is. So keep your teammates skillset in mind and preferences in mind.
Here’re are a few tools worth exploring.Continue reading ->