MLOps Blog

Scaling Machine Learning Experiments With neptune.ai and Kubernetes

Jules Belveze

9 min

16th May, 2024

ML Model Development ML Tools

Scaling machine learning (ML) experiments is a challenging process that requires efficient resource management, experiment tracking, and infrastructure scalability.

neptune.ai offers a centralized platform to manage ML experiments, track real-time model performance, and store metadata.

Kubernetes automates container orchestration, improves resource utilization, and enables horizontal and vertical scalability.

Combining neptune.ai and Kubernetes provides a robust solution for scaling ML experiments, making it easier to manage and scale experiments across multiple environments and team members.

Scaling machine-learning experiments efficiently is a challenge for ML teams. The complexity lies in managing configurations, launching experiment runs, tracking their outcomes, and optimizing resource allocation.

This is where experiment trackers and orchestration platforms come in. Together, they enable efficient large-scale experimentation. Neptune and Kubernetes are a prime example of this synergy.

In this tutorial, we’ll cover:

The challenges of scaling ML experimentation.
The roles of Neptune and Kubernetes in scaling model training.
How to combine Kubernetes, Neptune, and the Hydra configuration framework to set up a highly scalable LLM finetuning experiment for text classification.
Tips and tricks for efficient ML experiment management with Neptune and Kubernetes.

Scalability challenges in training machine learning models

Scaling ML model training comes with several challenges that organizations and researchers must navigate to efficiently leverage their computational resources and manage their ML models effectively. These challenges stem from both the complexity of scaling ML models and workflows and the limitations of the underlying infrastructure. The main challenges in scaling ML algorithms and training experiments are the following:

Experiment tracking and management: As the number of experiments grows, tracking each experiment’s parameters, code versions, datasets, and outcomes becomes increasingly complex. Without a robust tracking system, it’s easy to lose track of experiments, leading to duplicated efforts or overlooked optimizations.
Reproducibility: Ensuring that experiments are reproducible and that models perform consistently across different environments and datasets is crucial for the validity of ML experiments.
Experimentation velocity: Speeding up the iteration cycle of experimenting with different models, parameters, and data preprocessing techniques is crucial for the rapid development of ML applications. Scaling up the number of experiments without losing velocity requires sophisticated automation and orchestration tools.
Resource management: It can be challenging to efficiently allocate computational resources among multiple experiments and ensure that these resources are optimally used. Overallocation can lead to wasteful spending, while underallocation can result in slow iteration processes.
Infrastructure elasticity and scalability: The underlying infrastructure must be able to scale up or down based on the demand of ML workloads. This elasticity is crucial for handling variable workloads efficiently but can be challenging to implement and manage.

Using neptune.ai and Kubernetes as solutions for scaling ML experiments

Now that we have identified the main challenges of distributed computing, we will explore how combining neptune.ai and Kubernetes can offer a powerful solution to scale distributed ML experiments efficiently.

Neptune and its role in scalability

Neptune enables teams aiming for horizontal and vertical scaling by managing and optimizing machine learning experiments. It helps them track, visualize, and organize their ML projects, allowing them to understand model performance better, identify areas for improvement, and streamline their workflows at scale.

Vertical scaling with Neptune

Vertical scaling means increasing the computational power of existing systems. This involves adding CPU and GPU cores, memory, and storage capacity to accommodate more complex algorithms, larger datasets, or both.

Neptune’s role in vertical scaling:

Efficient resource management: Neptune automatically logs system metrics to help track and optimize the use of computational resources.
Performance monitoring: Neptune offers real-time monitoring of model performance, helping to ensure that systems remain efficient and effective and enabling early abort of unnecessary experiments, freeing computational resources.

Horizontal scaling with Neptune

Horizontal scaling involves adding more compute instances to handle an increased workload, such as more users, more data, or an increased number of experiments. The system’s capacity grows laterally—adding more processing units rather than making existing units more powerful.

Neptune’s role in horizontal scaling:

Distributed systems management: Neptune excels at managing experiments across multiple machines, facilitating seamless integration and synchronization across distributed computing resources.
Scalability of data logging: As the scale of operations grows, so does the volume of data from experiments. Neptune handles large volumes of data logs efficiently, maintaining performance without bottlenecks. It also enables users to asynchronously synchronize data logged locally with the server without interrupting other tasks.
Collaboration and integration: Neptune’s seamless integrations with other MLOps tools and cloud services ensure that as the number of experiments and people involved increases, all team members can maintain a current, unified view of the ML lifecycle.

Editor’s note

Do you feel like experimenting with neptune.ai?

Create a free account right away and give it a go
Try it out first and learn how it works (zero setup, no registration)
See the docs or watch a short product demo (20 min)

Kubernetes and its role in scalability

Before we dive into the details of how Kubernetes contributes to scalability in machine learning, let’s take a step back and quickly recap some Kubernetes fundamentals.

Kubernetes is a system for managing containerized applications based on several core concepts. The arguably most important component is the “cluster,” a set of “nodes” (compute instances) on which the application containers run. A Kubernetes cluster consists of one or multiple nodes.

Each node runs a “kubelet,” an agent that launches and monitors “pods” (the basic scheduling unit in Kubernetes) and communicates with the “control plane”.

The cluster’s control plane makes global decisions and responds to cluster events. The “scheduler,” part of the control plane, is responsible for assigning applications to nodes based on resource availability and policies.

The Future of Kubernetes and Open Source

I don't think that Kubernetes is going away anytime soon. In the machine learning space, most of the tooling is utilizing Kubernetes. It’s a popular and fairly efficient way to share resources amongst multiple people. Maciej Mazur, Principal ML Engineer at Canonical

Watch Neptune’s CPO Aurimas Griciūnas and Maciej Mazur, Principal ML Engineer at Canonical, discuss the future of Kubernetes and open-source software in machine learning.

The different aspects of scaling in Kubernetes

There are several different ways in which Kubernetes can scale applications and the cluster.

The HorizontalPodAutoscaler spins up or down pod replicas to appropriately handle incoming requests. This is particularly relevant for machine learning inference: If there are a lot of prediction requests, more model server instances can be added automatically to handle the load.

In the context of training machine learning models, vertical scaling and autoscaling of the cluster itself are typically more relevant.

Vertical pod autoscaling adjusts the resources available to pods to accommodate the needs of more intensive computational tasks without increasing the number of containers running. This is particularly useful when dealing with computationally hungry ML workloads.

Additionally, cluster autoscaling dynamically adjusts the number and type of nodes in a cluster based on the workload’s requirements. If the aggregate demand from all pods exceeds the current capacity of the cluster, new nodes are automatically added. Similarly, surplus nodes are removed when they are no longer needed.

This level of dynamic resource management is critical in maintaining cost efficiency and ensuring that ML experiments can run with the required computational resources without manual intervention.

Efficient resource utilization

Kubernetes optimizes resource utilization through advanced scheduling algorithms. Based on the resource requests and limits specified for each pod, it places containers on nodes that meet the specific computational requirements.

GPU scheduling: Kubernetes offers support for scheduling GPUs through the use of node labels and resource requests. For ML experiments requiring GPU resources for training deep learning models, pods can be configured with specific resource requests, including nvidia.com/gpu for NVIDIA GPUs, ensuring that these pods are scheduled on nodes equipped with the appropriate GPU resources. Under the hood, this is managed through Kubernetes’ device plugins, which extend the kubelet to enable additional resource types.

Storage optimization: Kubernetes manages storage via Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) for ML tasks requiring significant data storage. These resources decouple physical storage from logical volumes, allowing dynamic storage provisioning based on workload demands.

Node affinity/anti-affinity and taints/tolerations: These features allow more granular control over pod placement. Node affinity rules can direct Kubernetes to schedule pods on nodes with specific labels (e.g., those indicating the presence of high-performance GPUs). Conversely, taints and tolerations prevent or permit pods from being scheduled on nodes based on specific criteria, effectively isolating and protecting critical ML workloads.

Environment consistency and reproducibility

Kubernetes ensures consistency across development, testing, and production environments, addressing the reproducibility challenge in ML experiments. By using containers, developers package their applications with all of their dependencies, which means the application runs the same regardless of where Kubernetes deploys it.

High availability and fault tolerance

Kubernetes enhances application availability and fault tolerance. It can detect and replace failed pods, redistribute workloads to available nodes, and ensure the system is resilient to failures. This capability is critical for maintaining the availability of ML services, especially in production environments.

Leveraging distributed training

Kubernetes’ architecture naturally supports distributed computation, allowing for parallel processing of a large dataset across multiple nodes by leveraging StatefulSets, for example. This means training complex distributed ML models (e.g., large-scale ML models like LLMs) can be significantly sped up.

Furthermore, Kubernetes’ ability to dynamically allocate resources ensures that each part of the distributed training process receives the necessary computational power without manual intervention. Workflow orchestrators like Apache Airflow, Prefect, Argo, and Metaflow can manage and coordinate these distributed tasks, providing a higher-level interface for executing and monitoring complex ML pipelines.

By leveraging these tools, ML models and workloads can be split into smaller, parallelized tasks that run simultaneously across multiple nodes. This setup reduces training time, accelerates data processing, and simplifies the management of larger datasets, resulting in more efficient distributed ML training.

Comparing Neptune and Kubernetes roles in scaling ML experiments — *Neptune’s and Kubernetes’ roles in scaling machine-learning experiments.*

When is Kubernetes not the right choice?

Kubernetes is a complex system tailored for orchestrating containerized applications across a cluster of multiple nodes. It is overkill for small, simple projects as it involves a steep learning curve and significant overhead for setup and maintenance. If you don’t need to handle high traffic, require automated scaling, or run distributed applications across multiple compute instances, simpler deployment methods or platforms can achieve the desired results with much less complexity and resource investment.

How to scale ML model training with neptune.ai and Kubernetes step-by-step

We will demonstrate how to scale machine learning model training using Neptune and Kubernetes with a step-by-step example.

In our example, the aim is to accurately classify the headlines of the tldr_news dataset. The tldr_news dataset consists of various tech news articles, each containing a headline, the content of the article, and a category in which the article falls (five categories in total). We will select several pre-trained models available on the HuggingFace Hub, fine-tune them on the tldr_news dataset, and compare their performance.

The full code for the tutorial is available on GitHub. To follow along, you’ll need to have Docker, Python 3.10, and Poetry installed on your machine and access to a Kubernetes (or minikube) cluster. We’ll set up everything else together. (If this is your first time interacting with the HuggingFace ecosystem, we encourage you to first go through their text classification tutorial before diving into our Neptune and Kubernetes tutorial.)

Step 1: Project initialization

We will start by creating our project using Poetry and adding all the necessary dependencies to run it:

poetry new neptune-k8cd neptune-k8poetry run python --version  # make sure this is 3.10.x

Now you can install the required dependencies:

poetry add transformers[torch]@^4.39.1 datasets@^2.18.0 hydra-core@^1.3.2 evaluate@^0.4.1 scikit-learn@^1.4.1.post1 neptune@^1.10.1

Step 2: Set up data preparation

Before we can start training and comparing models, we first need to prepare a dataset. Since the tokenization depends on the model we’re using, we’ll write a class that takes the name of the pre-trained model as a parameter and selects the correct tokenizer. The dataset processing will be executed at the beginning of each training run, ensuring that the data matches the model.

So, let’s create a file called data.py and implement this class:

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding


class TldrClassificationDataset:
    """Class to prepare a dataset for text classification with a pretrained model."""

    def __init__(self, pretrained_model: str):
        """Initializes the dataset handler with a tokenizer and data collator based on a pretrained model.

        Args:
            pretrained_model: A string representing the pretrained model to use for tokenization.
        """
        self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
        self.data_collator = DataCollatorWithPadding(
            tokenizer=self.tokenizer,
            padding="longest"
        )

    def preprocess(self, examples):
        """Tokenizes the text examples for classification.

        Args:
            examples: A dictionary containing the text examples with a 'headline' key.

        Returns:
            Tokenized outputs with added truncation and length constraints.
        """
        return self.tokenizer(
            examples["headline"],
            truncation=True,
            max_length=64
        )

    def prepare_dataset(self):
        """Loads, cleans and preprocesses the text classification dataset.

        Returns:
            A processed dataset ready for model training or evaluation with labels and tokenized inputs.
        """
        dataset = load_dataset("JulesBelveze/tldr_news")
        dataset = dataset.remove_columns("content")
        dataset = dataset.rename_columns({"category": "labels"})
        featurized_dataset = dataset.map(self.preprocess, batched=True)
        return featurized_dataset

Calling the prepare_dataset method of this class returns an instance of datasets.DatasetDict ready for direct use in training or evaluating a text classification model, with each text input properly tokenized and all input features uniformly structured.

Note that the prepare_dataset method returns both the training dataset and validation dataset. You can access them directly through prepare_dataset[“train”] and prepare_dataset[“test”]. The dataset is already split into a train and test set when we download it, which ensures that the model performance is evaluated on the same data every time.

Step 3: Set up the training procedure

Now that we have declared the necessary steps to prepare the dataset for training, we need to define the model and the training procedure.

Define the training pipeline using Hydra

To do so, we will leverage the power of Hydra to define and configure our experiments. Hydra is a Python framework developed by Facebook Research that simplifies configuration management in applications. It uses YAML files for dynamic configuration through a hierarchical composition system. Key features include command-line overrides, support for multiple environments, and easy launching of varied configurations. It is especially useful for machine learning experiments where complex, changeable configurations are common.

We chose to use Hydra as it allows us to define the entire training pipeline within a single YAML file. Here is what our complete config.yaml file looks like:

pretrained_model_name:

dataset:
  _target_: neptune_k8.data.TldrClassificationDataset
  pretrained_model: ${pretrained_model_name}

model:
  _target_: transformers.models.auto.AutoModelForSequenceClassification.from_pretrained
  pretrained_model_name_or_path: ${pretrained_model_name}
  num_labels: 5

neptune_callback:
  _target_: transformers.integrations.integration_utils.NeptuneCallback
  name: ${oc.env:NEPTUNE_PROJECT}
  project: ${oc.env:NEPTUNE_USER}/${oc.env:NEPTUNE_PROJECT}  tags: [“tldr-news”, “neptune-k8s-tutorial”]

Let’s go through this file step by step:

The model to use will dynamically be defined through the pretrained_model_name parameter when we launch a training run.
In Hydra configurations, the _target_ key specifies the fully qualified Python class or function that should be instantiated or called during a process step. Any other keys in a block (such as num_labels in the models block) are passed as keyword arguments.

Using this key, in the dataset block we link to the TldrClassificationDataset class we created in the previous step. In the model block, we define that we’ll instantiate the AutoModelForSequenceClassification object using the from_pretrained class method.
We leverage Neptune’s transformers integration to track our experiment by logging different training data metadata, the experiment’s configuration, and the model’s performance metrics. Neptune’s development team contributed and maintains this integration.

At this point, we don’t have to specify a project or an API key yet. Instead, we add environment variables that we’ll populate later.

Create a training and evaluation script

Now that we’ve defined our training pipeline, we can create a training and evaluation script that we’ll call main.py:

import evaluate
import hydra
import numpy as np
from hydra.utils import instantiate
from transformers import TrainingArguments, Trainer

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)


@hydra.main(config_path=".", config_name="config”)
def run(args):
    dataset_wrapper = instantiate(args.dataset)
    featurized_dataset = dataset_wrapper.prepare_dataset()
    model = instantiate(args.model)

    trainer = Trainer(
        model=model,
        train_dataset=featurized_dataset["train"],
        eval_dataset=featurized_dataset["test"],
        tokenizer=dataset_wrapper.tokenizer,
        data_collator=dataset_wrapper.data_collator,
        compute_metrics=compute_metrics,        callbacks=[args.neptune_callback],
    )

    trainer.train()


if __name__ == "__main__":
    run()

Note that we define the location of the Hydra configuration file through the @hydra.main decorator that wraps the run function. The decorator injects an args object that allows us to access the configuration parameters of the particular training run.

Configure Neptune

The last thing we need to do before starting our first training run is to configure Neptune.

If you don’t have an account yet, first head over to neptune.ai/register to sign up for a free personal account.

Once you’re logged in, create a new project “neptune-k8” by following the steps outlined here. As the project key, I suggest you choose “KUBE”.

After you’ve created the project, get your API token and set the environment variables we referenced in our Hydra configuration file:

export NEPTUNE_PROJECT=neptune-k8
export NEPTUNE_USER=<YOUR_USERNAME>
export NEPTUNE_API_TOKEN=<YOUR_API_TOKEN>

Manually launch a training run

Finally, we can manually launch a training run using the following command:

poetry run python -m neptune_k8.main +pretrained_model_name=bert-base-uncased

If you now head to your Neptune project page and click on the latest experiment, you can watch the training process. You should see a bunch of logs informing you that you have downloaded the model’s weights and that the training process is running, similar to what’s shown in this screenshot:

Step 4: Dockerize the experiment

To scale our machine learning experiment by running it on a Kubernetes cluster, we need to integrate Docker into our workflow. Containerization through Docker ensures environment consistency, reproducibility, portability, isolation, and ease of deployment.

Let’s create a Dockerfile that prescribes how to install all required dependencies and packages our code and configuration:

FROM python:3.10-slim

ENV POETRY_HOME='/usr/local'
ENV PATH="$POETRY_HOME/bin:$VENV_PATH/bin:$PATH"

RUN apt-get update \
    && apt-get install --no-install-recommends -y \
      curl build-essential

RUN curl -sSL https://install.python-poetry.org | python3 -

ENV APP_HOME /app
WORKDIR $APP_HOME

COPY poetry.lock pyproject.toml ./
RUN poetry install --only=main

COPY . $APP_HOME/

ENTRYPOINT ["poetry", "run", "python3", "-m", "neptune_k8.main"]

Then, we create the neptune-k8 Docker image by running:

docker build -t neptune-k8:latest -f Dockerfile .

To use this image on a Kubernetes cluster, you’ll have to make it available in an image registry.

If you’re working with minikube, you can use the following command to make the image available to your minikube cluster:

eval $(minikube docker-env)

For details and other options, see the minikube handbook.

If you’re working with a Kubernetes cluster set up differently, you’ll have to push to an image registry from which the cluster’s nodes can pull.

Step 5: Launching Kubernetes training jobs

We now have a fully defined training process and a Docker image containing our training code. With this, we’re ready to run it with different pre-trained models in parallel to determine which performs best.

Specifically, we’ll execute each experiment run as a Kubernetes Job. The job launches a Pod with our training container and waits until training completes. It will be up to the cluster to find and provide the resources required by the Pod. If the cluster does not have a sufficient number of nodes to run all requested jobs simultaneously, it will either add additional nodes (cluster autoscaling) or queue jobs until resources are freed up.

Here’s the deploy.sh Bash script for creating the Job manifests and submitting them to the Kubernetes cluster:

#!/bin/bash
MODELS=("distilbert/distilbert-base-uncased" "roberta-base" "bert-base-uncased" "albert/albert-base-v2")

for MODEL in "${MODELS[@]}"
do
  JOB_YAML=$(cat <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: neptune-k8-job-
spec:
  template:
    spec:
      containers:
      - name: neptune-k8
        image: neptune-k8:latest
        args: ["+pretrained_model_name=$MODEL"]
        env:
        - name: NEPTUNE_USER
          value: "${NEPTUNE_USER}"
        - name: NEPTUNE_PROJECT
          value: "${NEPTUNE_PROJECT}"
        - name: NEPTUNE_API_TOKEN
          value: "${NEPTUNE_API_TOKEN}"        resources:          requests:            cpu: "1"            memory: "4Gi"          limits:            cpu: "2"            memory: "8Gi"
      restartPolicy: Never
  backoffLimit: 4
EOF
)

  echo "$JOB_YAML" | kubectl apply -f -
done

For the sake of our example, we only try four models, but we can scale it up to hundreds of models by adding more names to the MODELS list.

Note that you’ll have to set the NEPTUNE_USER, NEPTUNE_PROJECT, and NEPTUNE_API_TOKEN environment variables in the terminal session you’re running the script from.

You also have to make sure that kubectl has access to your cluster. To inspect the currently configured context, run

kubectl config current-context

With Neptune and Kubernetes access in place, you can execute the shell script and launch the training job:

./deploy.sh

Step 6: Model performance assessment

With the training jobs launched, we head over to app.neptune.ai. There, we select our project, filter out our experiments by the tag “neptune-k8-tutorial”, tick the runs we want to compare and click “Compare runs”.

In our case, we want to compare the accuracy of the four models throughout the training epochs to identify the most accurate model. By inspecting historical data in the graph below, we see that the purple experiment, corresponding to albert/albert-base-v2, leads to the best accuracy.

Tips & Tricks

Specify Job resource requirements
Specifying resource requirements and limits in a Kubernetes Job is crucial for ensuring the job is provided with the resources required to run, and at the same time preventing it from consuming all resources on a node. Correctly defined requirements and limits help to optimize utilization of cluster resources by enabling better scheduling decisions. While requirements ensure a job can run optimally, resource limits are crucial for a cluster’s overall stability and performance reliability.
Use the nodeSelector
Using nodeSelector is good practice when running ML experiments with different resource requirements. It allows you to specify which nodes should run your ML experiments, ensuring they are executed on nodes with the necessary hardware resources (like GPUs) for efficient training.

For example, in our to run our training pods only on nodes with the label pool: gpu-nodepool, we would modify the Job manifest as follows:

apiVersion: batch/v1
kind: Job
metadata:
  name: neptune-k8-job-
spec:
  template:
    spec:
      containers:
      - name: neptune-k8
        image: neptune-k8:latest
        args: ["+pretrained_model_name=$MODEL"]
        env:
        - name: NEPTUNE_USER
          value: "${NEPTUNE_USER}"
        - name: NEPTUNE_PROJECT
          value: "${NEPTUNE_PROJECT}"
        - name: NEPTUNE_API_TOKEN
          value: "${NEPTUNE_API_TOKEN}"
      restartPolicy: Never      nodeSelector:        pool: gpu-nodepool  backoffLimit: 4

Use Neptune’s tags and filtering system
Having multiple collaborators running a large number of experiments each can lead to a hardly navigable run table. To overcome this problem, wisely tagging experiments turns out very handy for isolating groups of experiments.

Conclusion

The combination of Neptune and Kubernetes is an excellent solution to the challenges teams face when scaling ML experimentation and model training. Neptune offers a centralized platform for experiment management, metadata tracking, and collaboration. Kubernetes provides the infrastructure to handle variable compute workloads and training jobs efficiently.

Beyond solving the scalability and management of ML experiments, Neptune and Kubernetes pave the way for efficient and robust ML model development and deployment. They allow teams to focus on innovation and achieving their objectives rather than being held back by the complexities of infrastructure management and experiment tracking.