MLOps Blog

How to Keep Track of TensorFlow/Keras Model Development with Neptune

8 min
10th August, 2023

The model development lifecycle starts with data exploration, then we choose features for our model, choose a baseline algorithm, and next, we try to improve baseline performance with different algorithms and parameter tuning.

Sounds simple enough. But, during all of this, you’ll probably create multiple notebooks, or modify one notebook over and over again. This is a great way to lose track of your research.

Luckily, it’s avoidable. You just need to log all your models, data, and features. This way, whenever you want to revisit something you were working on before, it’s very easy to do.

Read also

15 Best Tools for Tracking Machine Learning Experiments

In this article, I’ll show you how to organize and track Tensorflow projects and keep everything nice and tidy.

It all comes down to MLOps, which is a set of principles and tools that brings together product teams and data science teams, and all the crucial operations of development, deployment, monitoring, management, and securing ML models in production. 

MLOps is basically DevOps, but for Machine Learning. The ML lifecycle should support model delivery at speed and scale in order to handle the velocity and volume of data in your organization. Why do you need this? Because it’s very difficult to get ML applications from the idea phase to actual deployment in production.

Challenges in ML model lifecycle

To stay on track with your experiments, it’s necessary to track code, data, model versions, hyperparameters, and metrics. Organizing them in a meaningful way will help you collaborate within your organization.

In real-world projects, data changes all the time. New tables are added, mislabeled points are removed, feature engineering techniques change, validation and testing data sets change to reflect the production environment. When data changes, everything based on that data changes too, but the code remains the same so it’s important to keep track of your data versions. 

May interest you

MLOps Challenges and How to Face Them

Ways to keep track of your ML experiments 

Proper experiment tracking makes it easy to compare metrics and parameters based on data versions, compare experiments, and compare best or worst predictions on test or validation sets. You can also analyze hardware consumption for model training. Look at the prediction explanations and feature importance from tools such as LIME.

The explanations below will help you to track your experiments amazingly and to obtain charts like the above attached one.                     

Specify project requirements

First, set a metric for your project (the threshold for performance). For example, optimize your model for the F1-score

The first deployment should involve building a simple model with a focus on building a proper ML pipeline for prediction. This will help you deliver value quickly and avoid the trap of spending too much time trying to build the perfect model. When you start a new ML project at your organization, experiment runs can quickly scale to tens, hundreds, even thousands. Your workflow will get muddy if you don’t track it.

So, tracking tools like Neptune are becoming a standard tool in ML projects. You can use it to log your data, model, hyperparameters, confusion matrix, graphs, and much more. Including a tool like Neptune in your workflow/code is very simple compared to the pain you experience when you don’t track anything.

To show you how to approach tracking, we’re going to train a text classification model using Tensorflow. We’ll train the model using LSTMs:

  • Long Short Term Memory networks are a special kind of RNN, capable of handling long-term dependencies.
  • LSTMs are specially designed to take care of the long-term dependency problem (remembering information for a longer time).
  • All RNNs have repeating modules of neural nets in the form of a chain.

The figure below represents repeating modules in an LSTM.

Don’t worry about what happens on the inside (if you’re craving to learn, check this article for in-depth insights about LSTMs).

Enough introduction, let’s implement and track model development using Neptune.

Before we do any modeling or analysis, let’s set up a well-organized ML codebase.

Avoid mess in your model development process with Neptune

Install dependencies for this project

We’ll be using Neptune in Jupyter notebooks, so we need both the Neptune client and Neptune jupyter extension. Configure Neptune for jupyter notebooks, it will help us save notebooks checkpoint to Neptune. Follow the commands below to do this.

!pip install neptune numpy~=1.19.2 tensorflow nltk
!pip install -U neptune-notebooks
!jupyter nbextension enable --py neptune-notebooks

After running the above commands, you’ll see the below extensions in your jupyter notebook.

Now that we’ve installed the necessary dependencies, let’s import them.

import tensorflow as tfl
import numpy as np
import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from neptune.integrations.tensorflow_keras import NeptuneCallback
import neptune
STOPWORDS = set(stopwords.words('english'))

Connecting your project to the Neptune client. If you’re new to the platform, read the guide to get started.

run = neptune.init_run(project='aravindcr/Tensorflow-Text-Classification',
                   api_token=’YOUR TOKEN’) # your credentials

These are some of the parameters I used for this project, and I logged them here. To follow along, use the notebook attached here in the app. To know more about logging your metadata, check this guide.

Save the hyperparameters (for each iteration)

# log metadata from dictionary
run['parameters'] = {'embed_dims': 64,
                    'vocab_size': 5000,
                    'max_len': 200,
                    'padding_type': 'post',
                    'trunc_type': 'post',
                    'oov_tok': '<OOV>',
                    'training_portion': 0.8

Version your dataset

The dataset we’re using is BBC news article data for classification. Download the data from here. You can also log your data to Neptune with the below command. 

This will help us track different versions of the dataset when performing experiments. This can be accomplished with Neptune’s set_property function and hashlib module in Python.

# to log single file object

In the below section I’ve created a list called labels and text which will help us store the labels of the news article and the actual text associated with that. We’re also removing the stopwords using nltk.

labels = []
texts = []

with open('news-docs-bbc.csv', 'r') as file:
    data = csv.reader(file, delimiter=',')
    for row in data:
        text = row[1]
        for word in STOPWORDS:
            token = ' ' + word + ' '
            text = text.replace(token, ' ')
            text = text.replace(' ', ' ')
train_size = int(len(texts) * training_portion)

Let’s split the data into training and validation sets. If you look at the above parameters, we’re using 80% for training and 20% for validating the model we’ve built for this use case.

train_text = texts[0: train_size]
train_labels = labels[0: train_size]

validation_text = texts[train_size:]
validaiton_labels = labels[train_size: ]

Let’s convert the sentences into subword token strings. It will take five thousand most common words. We use oov_token whenever we encounter special values which are unseen

<00V> will be used for words that aren’t found in word_index. fit_on_texts will update the internal vocabulary based on a list of texts. This method creates a vocabulary index based on word frequency.

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oot_tok)
word_index = tokenizer.word_index


As we can see in the above output, <oov> is the most common token in the corpus followed by other words.

Now that we’ve created a vocabulary index based on frequency, let’s convert those tokens into lists of sequences,
text_to_sequence transforms text into a sequence of integers. In simple terms, it converts words in the text to the corresponding integer value in the word_index dictionary.

train_sequences = tokenizer.texts_to_sequences(train_text)

train_padded = pad_sequences(train_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

One thing to keep in mind when training neural nets on your downstream NLP task the sequence needs to be in the same size, so we’re padding those sequences using the max_len parameter. In our case, I’ve specified 200 in the beginning, which is why we’re using padding_sequences below. 

The articles with sequence lengths smaller or greater than the max_len will be truncated to 200. For example, if the sequence length is 186 it will be padded to 200 with 14 zeros. Usually, we fit the data once but convert the sequence many times so we haven’t combined the training and validation sets.

valdn_sequences = tokenizer.texts_to_sequences(validation_text)
valdn_padded = pad_sequences(valdn_sequences,


Let’s take a look at our labels. Labels need to be tokenized and all training labels are expected to be in the form of a NumPy array. We’ll convert those to a NumPy array with the below code.

label_tokenizer = Tokenizer()

label_tokenizer = Tokenizer()

Before starting with the modeling task, let’s see how they look after and before padding. We can see that some of the words become <oov> because they don’t come in the vocab_size mentioned at the top.

word_index_reverse = dict([(value, key) for (key, value) in word_index.items()])

def decode_article(text):
    return ' '.join([word_index_reverse.get(i, '?') for i in text])

Train TensorFlow model

With tfl.keras.sequential we group a linear stack of layers into tfl.keras.Model. The first layer is an embedding layer, it stores one vector per word. Sequences of words are converted into sequences of vectors. Embeddings in NLP are mainly done to make closer words with similar meanings have similar vector representations (word embeddings are word vectors where words with similar meanings have similar vector representations).

tfl.keras.layers.Bidirectional is the bidirectional wrapper for RNNs, which helps to propagate inputs forwards and backwards through LSTM layers and then links the outputs. This is good for learning long-term dependencies in LSTMs. To do classification, we then form it into a dense neural network.

The activation functions we’re using here are relu and softmax. The relu function returns 0 if it returns the negative input, but for any of the positive values of x it returns the value. To know a little bit more about relu, check the guide.

The dense layer is added with six units. The final layer is the ‘softmax’ activation function, which normalizes the output of the network to a probability distribution over predicted output classes.

model = tfl.keras.Sequential([
    tfl.keras.layers.Embedding(vocab_size, embed_dims),
    tfl.keras.layers.Dense(embed_dims, activation='relu'),
    tfl.keras.layers.Dense(6, activation='softmax')

As you can see above in the model summary, we have an embedding layer and bidirectional LSTM. Output from bidirectional is double what we put in LSTM. 

The loss function I’ve used here is categorical_cross_entropy, usually used in multi-class classification tasks. It mainly quantifies the difference between two probability distributions. The optimizer we’re using is ‘adam’, a variant of gradient descent.

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

ML model development organized using Neptune

To log training metrics to Neptune, we use a callback from the Neptune library. For example, as shown below log metadata from Tensorflow / Keras and NeptuneCallback. This helps you log most metadata you would normally log in these ML libraries:

from neptune.integrations.tensorflow_keras import NeptuneCallback

neptune_clbk = NeptuneCallback(run=run, base_namespace='metrics')

epochs_count = 10
history =, training_label_seq, epochs=epochs_count, validation_data=(valdn_padded, validation_label_seq), verbose=2, callbacks=[neptune_clbk])

 Now, when you run this, all your metrics and losses will be logged in Neptune.

We can also monitor RAM and CPU usage as part of model training. The information can be found in the monitoring section of the experiments.

Establish a baseline for model performance. Start with a simple model using the initial data pipeline. Find the state of the area model for the problem in your domain and then reproduce the results. Later, apply your datasets to the next baseline.           

Model versioning

tfl.keras.models.save_model(model, 'classification.h5', overwrite=True, include_optimizer=True, save_format=None,
        signatures=None, options=None, save_traces=True)'my_model')


The model is saved in the saved_model directory in Neptune.

Log whatever from the project

def graphs_plotting(history, string):
    plt.legend([string, 'val_'+string])

graphs_plotting(history, 'accuracy')
graphs_plotting(history, 'loss')
#Define parameters as a Python dictionary and log them all at once to a Namespace of a Run. 

PARAMS = {'epoch_num': 10,
          'batch_size': 64,
          'optimizer': 'adam',
          'loss_fun': 'categorical_corss_entropy',
          'metrics': ['accuracy'],
          'activation': 'relu'}

# Pass parameters
run['parameters'] = PARAMS

# you can 

Parameters logged can be found in Neptune.

While working on an industry project, the metrics you use can change over time based on the problem you’re working on and the domain in which the model is deployed. Logging metrics can save your team a significant amount of time.

In Neptune, all your experiments are organized in a single place. You can add tags to your experiments, keep track of exactly what you tried, compare metrics, reproduce or rerun experiments when you need to. You can sleep peacefully knowing that all your ideas are safely receding in one place.

You can also add tags to your experiments, which will help you track your experiments in a much better way. Before model deployment, make sure to have versioning in place for: model configuration, model parameters training dataset, and validation dataset. Some of the common ways to deploy ML models are to package them to a docker container, and also – for inference exposure – to the REST API.

You can also compare multiple versions of notebooks you’ve logged to Neptune.

Model refinement

The above model starts overfitting after 6 epochs. You can change the epochs and retrain your model, then log your parameters to Neptune.

As the complexity is added to your model, debug it iteratively. Performing error analysis is necessary to find where the model fails. Track how model performance scales as the amount of data is increased for training. Once you have the idea to successfully build models for your problem, later you should try getting the best performance from the model. Split your error into: 

  • avoidable bias, 
  • variance, 
  • irreducible error,
  • difference between test error and validation error.

Addressing underfitting (high bias, low variance)

Perform model-specific optimization. If your model is underfitting, then it has captured the pattern and also noise in your data, but it’s not performing well on your training as well as test data. It’s important to version your data and change model parameters. You can address underfitting by error analysis, increasing model capacity, tuning your hyperparameters, and adding new features.

Addressing overfitting 

When your model is overfitting, it performs well on training data and poorly on test data. It’s an indication that your model has high variance and low bias. Survey the literature about such problems, talk to experts in your team or people you know who might have dealt with similar problems. 

We can address overfitting by adding more training data, regularization, error analysis, tuning hyperparameters, and reducing model size.

Read also

Overfitting vs Underfitting in Machine Learning – Everything You Need to Know

Addressing distribution shift

Refining your model is very important because the model you have built may fail in some scenarios. There will be risks involved with using your approach in production. 

We can address distributional shifts by performing error analysis in order to determine the shift in distribution. Augment your data to more closely match the test distribution, and apply domain adaptation techniques.

Debugging ML projects

This step is mainly done to investigate why your model is performing poorly. 

There may be some implementation bugs, dataset construction issues, or bad hyperparameters. Deploy a baseline model on production data as quickly as possible. Often, live data changes in unexpected ways. It may not reflect the data you’ve used during development (commonly known as data drift). 

Deploy a simple model quickly so that you will know what you need to do in advance. This helps quicker iterations rather than slow iterations trying to make a perfect model. A random seed needs to be fixed to ensure the model training is reproducible.

With proper tracking of your experiments, some of the above challenges can be taken care of, and it will be easier to communicate your results to the team.

Learn more

In-depth Guide to ML Model Debugging and Tools You Need to Know


To build machine learning projects efficiently, start simple and gradually increase complexity. Often, Data Scientists and ML Engineers are presented with poorly expressed problems to develop ML solutions for. 

Spend a good amount of time understanding the scope of your project and define requirements clearly in advance, making your iterations better as you work towards your final goal. 



Was the article useful?

Thank you for your feedback!