The Best Pachyderm Alternatives

Posted July 2, 2020

Pachyderm is a data science platform that helps to control an end-to-end machine learning life cycle. It comes in three different versions, Community Edition (open-source, with ability to be deployed anywhere), Enterprise Edition (complete version-controlled platform), and Hub Edition (a hosted version, still in beta). 

Here’s what you can do with Pachyderm in a nutshell:

  • Pachyderm lets you to continuously update data in the master branch of your repo, while experimenting with specific data commits in a separate branch or branches
  • It supports any type, size, and number of files including binary and plain text files
  • You can store the history of your commits in a centralized location so you don’t run into merge conflicts when you try to merge your .git history with the master copy of the repo
  • Centralized and transactional commits branches are not used as extensively as in source code version-control systems
  • Provenance enables teams to build on each other work, share, transform, and update datasets while automatically maintaining a complete audit trail so that all results are reproducible

It surely is a powerhouse for ML experiments, but if you’re looking for a different solution focused more on specific aspects (part of the lifecycle) of machine learning projects, or need a more lightweight tool, Pachyderm may not be te perfect tool for you.

That’s where we come with help—here are all the best Pachyderm alternatives. Take a look and chose your favorite alternative!

Data and Pipeline versioning

1. DVC

DVC, or Data Version Control, is an open-source version control system for machine learning projects. It’s an experimentation tool that helps you define your pipeline regardless of the language you use.

When you find a problem in a previous version of your ML model, DVC helps to save time by leveraging code, data versioning, and reproducibility. You can also train your model and share it with your teammates via DVC pipelines.

DVC can cope with versioning and organization of big amounts of data and store them in a well-organized, accessible way. It focuses on data and pipeline versioning and management but also has some (limited) experiment tracking functionalities.

DVC – summary:

  • Possibility to use different types of storage— it’s storage agnostic
  • Full code and data provenance help to track the complete evolution of every ML model
  • Reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment
  • Tracking metrics
  • A built-in way to connect ML steps into a DAG and run the full pipeline end-to-end

2. Kubeflow

Kubeflow is the ML toolkit for Kubernetes. It helps in maintaining machine learning systems – manage all the applications, platforms, and resource considerations. It facilitates the scaling of machine learning models by making run orchestration and deployments of machine learning workflows easier.

It’s an open-source project that contains a curated set of compatible tools and frameworks specific for various ML tasks.

Kubeflow – summary:

  • A user interface (UI) for managing and tracking experiments, jobs, and runs
  • Notebooks for interacting with the system using the SDK
  • Re-use components and pipelines to quickly create end-to-end solutions without having to rebuild each time
  • Kubeflow Pipelines is available as a core component of Kubeflow or as a standalone installation

Experiment tracking and Meta ML

1. Neptune

Neptune is lightweight experiment management and collaboration tool. It is very flexible, works with many other frameworks, and thanks to its stable user interface, it enables great scalability (to millions of runs).

It’s a robust software that can store, retrieve, and analyze a large amount of data. Neptune has all the tools for efficient team collaboration and project supervision.

Neptune – summary:

  • Provides user and organization management with a different organization, projects, and user roles
  • Fast and beautiful UI with a lot of capabilities to organize runs in groups, save custom dashboard views and share them with the team
  • You can use a hosted app to avoid all the hassle with maintaining yet another tool (or have it deployed on your on-prem infrastructure)
  • Your team can track experiments which are executed in scripts (Python, R, other), notebooks (local, Google Colab, AWS SageMaker) and do that on any infrastructure (cloud, laptop, cluster)
  • Extensive experiment tracking and visualization capabilities (resource consumption, scrolling through lists of images)

2. MLflow

MLflow is an open-source platform that helps manage the whole machine learning lifecycle that includes experimentation, reproducibility, deployment, and a central model registry. 

MLflow is suitable for individuals and for teams of any size. 

The tool is library-agnostic. You can use it with any machine learning library and in any programming language

MLflow comprises four main functions:

  • MLflow Tracking – an API and UI for logging parameters, code versions, metrics, and artifacts when running machine learning code and for later visualizing and comparing the results
  • MLflow Projects – packaging ML code in a reusable, reproducible form to share with other data scientists or transfer to production
  • MLflow Models – managing and deploying models from different ML libraries to a variety of model serving and inference platforms
  • MLflow Model Registry – a central model store to collaboratively manage the full lifecycle of an MLflow Model, including model versioning, stage transitions, and annotations

👉 Check out the comparison between ML flow & Neptune!

>> Also, see our integration with MLflow

Training Run Orchestration

1. Amazon SageMaker

Amazon SageMaker is a platform that enables data scientists to build, train, and deploy machine learning models. It has all the integrated tools for the entire machine learning workflow providing all of the components used for machine learning in a single toolset.

SageMaker is a tool suitable for organizing, training, deployment,  and managing machine learning models. It has a single, web-based visual interface to perform all ML development steps – notebooks, experiment management, automatic model creation, debugging, and model drift detection

Amazon SageMaker – summary:

  • Autopilot automatically inspects raw data, applies feature processors, picks the best set of algorithms, trains and tunes multiple models, tracks their performance, and then ranks the models based on performance – it helps to deploy the best performing model
  • SageMaker Ground Truth helps you build and manage highly accurate training datasets quickly
  • SageMaker Experiments helps to organize and track iterations of machine learning models  by automatically capturing the input parameters, configurations, and results, and storing them as ‘experiments’
  • SageMaker Debugger automatically captures real-time metrics during training (such as training and validation, confusion, matrices, and learning gradients) to help improve model accuracy. The Debugger can also generate warnings and remediation advice when common training problems are detected
  • SageMaker Model Monitor allows developers to detect and troubleshoot concept drift. It automatically detects concept drift in deployed models and gives detailed alerts that help identify the source of the problem

>> See our integration with Amazon SageMaker

2. Polyaxon

Polyaxon is a platform for reproducing and managing the whole life cycle of machine learning projects as well as deep learning applications.

The tool can be deployed into any data center, cloud provider, and can be hosted and managed by Polyaxon. It supports all the major deep learning frameworks, e.g., Torch, Tensorflow, MXNet.

When it comes to orchestration, Polyaxon lets you maximize the usage of your cluster by scheduling jobs and experiments via their CLI, dashboard, SDKs, or REST API.

Polyaxon – summary:

  • Supports the entire lifecycle including run orchestration but can do way more than that
  • Has an open-source version that you can use right away but also provides options for enterprise
  • Very well Documented platform, with technical reference docs, getting started guides, learning resources, guides, tutorials, changelogs, and more
  • Monitor, track, and analyze every single optimization experiment with the experiment insights dashboard

👉 Check out the comparison between Polyaxon & Neptune

Wrapping it up

Finding the right alternative for Pachyderm may not be easy. Every tool is great and offers helpful functionalities. But once you know what exactly you’re looking for, you’ll find the perfect fit. Try out, mix, and experiment. After all, that’s what machine learning is all about.

Happy experimenting!