Managing Machine Learning Projects is not exactly a piece of cake but every data scientist already knows that.
It touches many things:
- Data exploration,
- Data preparation and setting up machine learning pipelines
- Experimentation and model tuning
- Data, pipeline, and model versioning
- Managing infrastructure and run orchestration
- Model serving and productionalization
- Model maintenance (retraining and monitoring models in production)
- Effective collaboration between different roles (data scientist, data engineer, software developer, DevOps, team manager, business stakeholder)
So how to make sure everything runs smoothly? How to administer all the parts to create a coherent workflow?
…by using the right tool that will help you manage machine learning projects. But you already know that 😉 There are many apps that can help you improve parts of or even an entire workflow. Sure, it’s really cool to do everything yourself, but why not use tools if they can save you lots of trouble
Let’s get right to it and see what’s on the plate! Below is a list of tools that touch various points listed above. Some are more end-to-end some are focused on a particular stage of the machine learning lifecycle but all of them will help you manage your machine learning projects. Check out our list and choose the one(s) you like most.
Table of contents
Neptune is a metadata store for MLOps built for research and production teams that run a lot of experiments. It is very flexible, works with many other frameworks, and thanks to its stable user interface, it enables great scalability (to millions of runs).
It’s a robust software that can store, retrieve, and analyze a large amount of data. Neptune has all the tools for efficient team collaboration and project supervision.
Neptune – summary:
- Provides user and organization management with a different organization, projects, and user roles
- Fast and beautiful UI with a lot of capabilities to organize runs in groups, save custom dashboard views and share them with the team
- You can use a hosted app to avoid all the hassle with maintaining yet another tool (or have it deployed on your on-prem infrastructure)
- Your team can track experiments which are executed in scripts (Python, R, other), notebooks (local, Google Colab, AWS SageMaker) and do that on any infrastructure (cloud, laptop, cluster)
- Extensive experiment tracking and visualization capabilities (resource consumption, scrolling through lists of images)
Kubeflow is the ML toolkit for Kubernetes. It helps in maintaining machine learning systems by packaging and managing docker containers. It facilitates the scaling of machine learning models by making run orchestration and deployments of machine learning workflows easier.
It’s an open-source project that contains a curated set of compatible tools and frameworks specific for various ML tasks.
Kubeflow – summary:
- A user interface (UI) for managing and tracking experiments, jobs, and runs
- Notebooks for interacting with the system using the SDK
- Re-use components and pipelines to quickly create end-to-end solutions without having to rebuild each time
- Kubeflow Pipelines is available as a core component of Kubeflow or as a standalone installation
DVC is an open-source version control system for machine learning projects. It’s a tool that lets you define your pipeline regardless of the language you use.
When you find a problem in a previous version of your ML model, DVC saves your time by leveraging code data, and pipeline versioning, to give you reproducibility. You can also train your model and share it with your teammates via DVC pipelines.
DVC can cope with versioning and organization of big amounts of data and store them in a well-organized, accessible way. It focuses on data and pipeline versioning and management but also has some (limited) experiment tracking functionalities.
DVC – summary:
- Possibility to use different types of storage— it’s storage agnostic
- Full code and data provenance help to track the complete evolution of every ML model
- Reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment
- Tracking metrics
- A built-in way to connect ML steps into a DAG and run the full pipeline end-to-end
Polyaxon is a platform for reproducing and managing the whole life cycle of machine learning projects as well as deep learning applications.
The tool can be deployed into any data center, cloud provider, and can be hosted and managed by Polyaxon. It supports all the major deep learning frameworks, e.g., Torch, Tensorflow, MXNet.
Polyaxon – summary:
- Supports the entire lifecycle including run orchestration but can do way more than that
- Has an open-source version that you can use right away but also provides options for enterprise
- Very well documented platform, with technical reference docs, getting started guides, learning resources, guides, tutorials, changelogs, and more
- Allows to monitor, track, and analyze every single optimization experiment with the experiment insights dashboard
GitHub is the most popular platform built for developers. It’s used by millions of teams around the globe as it allows for easy and painless collaboration. With GitHub, you can host and review code, manage projects, and build software.
It’s a great platform for teams collaborating on machine learning projects who want to simplify workflow and share ideas conveniently. GitHub lets teams manage ideas, coordinate work, and stay aligned with the entire team to seamlessly collaborate on machine learning projects.
GitHub – summary:
- Build, test, deploy, and run CI/CD the way you want in the same place you manage code
- Use Actions to automatically publish new package versions to GitHub Packages. Install packages and images hosted on GitHub Packages or your preferred registry of record in your CI/CD workflows
- The software lets you secure your work with vulnerability alerts so you can remediate risks and learn how CVEs affect you
- The build-in review tools make it easy and convenient to review code – a team can propose changes, compare versions, and give feedback
- GitHub easily integrates with other tools for smooth work, or you can create your own tools with GitHubGraphQL API
- GitHub is a platform where all the documentation is easily accessible, and all the features make it a unified system for flexibly developing software.
You may also like
Jira is a great software for agile teams as it allows for fully-encompassed project management. It’s an issue and project tracking tool so teams can plan, track, and release their product or software as a perfectly developed ‘organism’. With Confluence, teams have even more flexibility to manage ML projects.
The two tools allow for flexible workflow automation. You can freely manage a project by assigning certain tasks to people, bugs to programmers, create milestones, or plan to carry certain tasks within a specific timeframe.
Products and apps built on top of the Jira combined with Confluence help teams plan, assign, track, report, and manage work. All updates from Jira will automatically appear in Confluence since the two tools are linked together.
Notion is a collaboration tool that lets you write, plan, and organize teamwork.
It has four modules, each with different functionalities:
- Notes, Docs – text editor which serves as a space for files, notes of different formats; you can add images, bookmarks, videos, code, and many more
- Knowledge Base – in this module, teams can store knowledge about projects, tools, best practices, and other aspects that are necessary for developing machine learning projects
- Tasks, Projects – tasks and projects can be organized in a Kanban board, calendar, and list views
- Databases – this module can effectively replace spreadsheets and keep records of important data and unique workflows in a convenient way
Additionally, every team member can use Notion for personal use to keep a record of work-related activities and information, for example, weekly agenda, goal, task list, or personal notes.
Other smallish features include #markdown. /Slash commands, drag-and-drop feature, comments and discussions, and integrations with 50+ popular apps such as Google Docs, Github Gist, CodePen, and more.
All modules create a coherent system that serves as a unified hub for work management and project planning.
Weights & Biases a.k.a. WandB is focused on deep learning. Users track experiments to the application with Python library, and – as a team – can see each other’s experiments.
WandB is a hosted service allowing you to backup all experiments in a single place and work on a project with your team. WandB lets you log many data types and analyze them in a nice UI.
Weights & Biases – summary:
- Experiments tracking: extensive logging options
- Multiple features for sharing work in a team
- Several open source integrations with other tools available
- SaaS/Local instance available
- WandB logs the model graph, so you can inspect it later
This one is an open-source Python library that enables you to build fancy custom web-apps for machine learning and data science. It is perfect when you need to build a quick proof-of-concept app and show it to someone, especially when that someone is a bit less technical.
In Streamlit you can automatically update your app every time you change its source code. This allows you to work in a fast interactive loop:
- You type code, save it, try it out live
- Then type some more code, save it, try it out again
- And so on.
Streamlit’s architecture allows you to write apps the same way you write plain Python scripts.
You can easily share your machine learning models with other people and effectively work in a team.
See how we built a streamlit app for exploring results of image segmentation and object detection models trained on COCO: How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way)
10. Amazon SageMaker
Amazon SageMaker is a platform that enables data scientists to build, train, and deploy machine learning models. It has all the integrated tools for the entire machine learning workflow providing all of the components used for machine learning in a single toolset.
SageMaker is a tool suitable for organizing, training, deployment, and managing machine learning models. It has a single, web-based visual interface to perform all ML development steps – notebooks, experiment management, automatic model creation, debugging, and model drift detection
Amazon SageMaker – summary:
- Autopilot automatically inspects raw data, applies feature processors, picks the best set of algorithms, trains and tunes multiple models, tracks their performance, and then ranks the models based on performance – it helps to deploy the best performing model
- SageMaker Ground Truth helps you build and manage highly accurate training datasets quickly
- SageMaker Experiments helps to organize and track iterations of machine learning models by automatically capturing the input parameters, configurations, and results, and storing them as ‘experiments’
- SageMaker Debugger automatically captures real-time metrics during training (such as training and validation, confusion, matrices, and learning gradients) to help improve model accuracy. The Debugger can also generate warnings and remediation advice when common training problems are detected
- SageMaker Model Monitor allows developers to detect and troubleshoot concept drift. It automatically detects concept drift in deployed models and gives detailed alerts that help identify the source of the problem
11. Domino Data Lab
Domino Data Lab is a great tool to manage machine learning projects for teams who need a centralized hub to store all their data.
Domino is a data science platform that enables fast, reproducible, and collaborative work on data products like models, dashboards, and data pipelines. You can run regular jobs, launch interactive notebook sessions, view vital metrics, share work with the teammates, and communicate with them directly in the Domino web app.
It’s an advanced management platform for all kinds of machine learning projects, especially helpful for growing organizations that need to share work, and review code fast and effectively.
Cortex is an open-source alternative to serving models with SageMaker or building your own model deployment platform on top of AWS services like Elastic Kubernetes Service (EKS), Lambda, or Fargate and open source projects like Docker, Kubernetes, TensorFlow Serving, and TorchServe.
It’s a multi framework tool that lets you deploy all types of models.
Cortex – summary:
- Automatically scale APIs to handle production workloads
- Run inference on any AWS instance type
- Deploy multiple models in a single API and update deployed APIs without downtime
- Monitor API performance and prediction results
There are many great tools to choose from. Make sure to look for integrations and features that suit your needs to get the most out of your work.
Enjoy managing your machine learning projects!
How to Improve the Collaboration in the ML/DS Team?
As a run tracking hub, Neptune provides several features for enabling knowledge sharing and collaboration among members of your data science team.
- have every piece of every run or notebook of every teammate in one place,
- see and compare all the teams’ experiments and models,
- see what everyone on the team is working on,
- share a view on a project or any of its parts, by simply copying and pasting the URL to it,
- collaborate with other team members on the results.
- Seeing all model training metadata in one place
- Comparing model training runs
- Seeing model training runs live
- Being able to reproduce model training runs
- Have a central registry for the models, runs, and notebooks,
- Check how the model was built,
- Find and fetch information they need for putting model in production.