Blog » ML Experiment Tracking » How to Keep Track of TensorFlow/Keras Model Development with Neptune

How to Keep Track of TensorFlow/Keras Model Development with Neptune

The model development lifecycle starts with data exploration, then we choose features for our model, choose a baseline algorithm, and next, we try to improve baseline performance with different algorithms and parameter tuning.

Sounds simple enough. But, during all of this, you’ll probably create multiple notebooks, or modify one notebook over and over again. This is a great way to lose track of your research.

Luckily, it’s avoidable. You just need to log all your models, data, and features. This way, whenever you want to revisit something you were working on before, it’s very easy to do.

In this article, I’ll show you how to organize and track Tensorflow projects and keep everything nice and tidy.

It all comes down to MLOps, which is a set of principles and tools that brings together product teams and data science teams, and all the crucial operations of development, deployment, monitoring, management, and securing ML models in production. 

MLOps is basically DevOps, but for Machine Learning. The ML lifecycle should support model delivery at speed and scale in order to handle the velocity and volume of data in your organization. Why do you need this? Because it’s very difficult to get ML applications from the idea phase to actual deployment in production.

Challenges in ML model lifecycle

To stay on track with your experiments, it’s necessary to track code, data, model versions, hyperparameters, and metrics. Organizing them in a meaningful way will help you collaborate within your organization.

In real-world projects, data changes all the time. New tables are added, mislabeled points are removed, feature engineering techniques change, validation and testing data sets change to reflect the production environment. When data changes, everything based on that data changes too, but the code remains the same so it’s important to keep track of your data versions. 

Ways to keep track of your ML experiments 

Proper experiment tracking makes it easy to compare metrics and parameters based on data versions, compare experiments, and compare best or worst predictions on test or validation sets. You can also analyze hardware consumption for model training. Look at the prediction explanations and feature importance from tools such as LIME.

The explanations below will help you to track your experiments amazingly and to obtain charts like the above attached one.                     

Specify project requirements

First, set a metric for your project (the threshold for performance). For example, optimize your model for the F1-score

The first deployment should involve building a simple model with a focus on building a proper ML pipeline for prediction. This will help you deliver value quickly and avoid the trap of spending too much time trying to build the perfect model. When you start a new ML project at your organization, experiment runs can quickly scale to tens, hundreds, even thousands. Your workflow will get muddy if you don’t track it.

So, tracking tools like Neptune are becoming a standard tool in ML projects. You can use it to log your data, model, hyperparameters, confusion matrix, graphs, and much more. Including a tool like Neptune in your workflow/code is very simple compared to the pain you experience when you don’t track anything.

To show you how to approach tracking, we’re going to train a text classification model using Tensorflow. We’ll train the model using LSTMs:

  • Long Short Term Memory networks are a special kind of RNN, capable of handling long-term dependencies.
  • LSTMs are specially designed to take care of the long-term dependency problem (remembering information for a longer time).
  • All RNNs have repeating modules of neural nets in the form of a chain.

The figure below represents repeating modules in an LSTM.

Don’t worry about what happens on the inside (if you’re craving to learn, check this article for in-depth insights about LSTMs).

Enough introduction, let’s implement and track model development using Neptune.

Before we do any modeling or analysis, let’s set up a well-organized ML codebase.

Avoid mess in your model development process with Neptune

Install dependencies for this project

We’ll be using Neptune in Jupyter notebooks, so we need both the Neptune client and Neptune jupyter extension. Configure Neptune for jupyter notebooks, it will help us save notebooks checkpoint to Neptune. Follow the commands below to do this.

!pip install neptune-client numpy~=1.19.2 tensorflow nltk
!pip install -U neptune-notebooks
!jupyter nbextension enable --py neptune-notebooks

After running the above commands, you’ll see the below extensions in your jupyter notebook.

Now that we’ve installed the necessary dependencies, let’s import them.

import tensorflow as tfl
import numpy as np
import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from import NeptuneCallback
import as neptune

STOPWORDS = set(stopwords.words('english'))

Connecting your project to the Neptune client. If you’re new to the platform, read the guide to get started.

run = neptune.init(project='aravindcr/Tensorflow-Text-Classification',
                   api_token=’YOUR TOKEN’) # your credentials

These are some of the parameters I used for this project, and I logged them here. To follow along, use the notebook attached here in the app. To know more about logging your metadata, check this guide.

Save the hyperparameters (for each iteration)

# log metadata from dictionary
run['parameters'] = {'embed_dims': 64,
                    'vocab_size': 5000,
                    'max_len': 200,
                    'padding_type': 'post',
                    'trunc_type': 'post',
                    'oov_tok': '<OOV>',
                    'training_portion': 0.8

Version your dataset

The dataset we’re using is BBC news article data for classification. Download the data from here. You can also log your data to Neptune with the below command. 

This will help us track different versions of the dataset when performing experiments. This can be accomplished with Neptune’s set_property function and hashlib module in Python.

# to log single file object

In the below section I’ve created a list called labels and text which will help us store the labels of the news article and the actual text associated with that. We’re also removing the stopwords using nltk.

labels = []
texts = []

with open('news-docs-bbc.csv', 'r') as file:
    data = csv.reader(file, delimiter=',')
    for row in data:
        text = row[1]
        for word in STOPWORDS:
            token = ' ' + word + ' '
            text = text.replace(token, ' ')
            text = text.replace(' ', ' ')
train_size = int(len(texts) * training_portion)

Let’s split the data into training and validation sets. If you look at the above parameters, we’re using 80% for training and 20% for validating the model we’ve built for this use case.

train_text = texts[0: train_size]
train_labels = labels[0: train_size]

validation_text = texts[train_size:]
validaiton_labels = labels[train_size: ]

Let’s convert the sentences into subword token strings. It will take five thousand most common words. We use oov_token whenever we encounter special values which are unseen

<00V> will be used for words that aren’t found in word_index. fit_on_texts will update the internal vocabulary based on a list of texts. This method creates a vocabulary index based on word frequency.

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oot_tok)
word_index = tokenizer.word_index


As we can see in the above output, <oov> is the most common token in the corpus followed by other words.

Now that we’ve created a vocabulary index based on frequency, let’s convert those tokens into lists of sequences,
text_to_sequence transforms text into a sequence of integers. In simple terms, it converts words in the text to the corresponding integer value in the word_index dictionary.

train_sequences = tokenizer.texts_to_sequences(train_text)

train_padded = pad_sequences(train_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

One thing to keep in mind when training neural nets on your downstream NLP task the sequence needs to be in the same size, so we’re padding those sequences using the max_len parameter. In our case, I’ve specified 200 in the beginning, which is why we’re using padding_sequences below. 

The articles with sequence lengths smaller or greater than the max_len will be truncated to 200. For example, if the sequence length is 186 it will be padded to 200 with 14 zeros. Usually, we fit the data once but convert the sequence many times so we haven’t combined the training and validation sets.

valdn_sequences = tokenizer.texts_to_sequences(validation_text)
valdn_padded = pad_sequences(valdn_sequences, 


Let’s take a look at our labels. Labels need to be tokenized and all training labels are expected to be in the form of a NumPy array. We’ll convert those to a NumPy array with the below code.

label_tokenizer = Tokenizer()

label_tokenizer = Tokenizer()

Before starting with the modeling task, let’s see how they look after and before padding. We can see that some of the words become <oov> because they don’t come in the vocab_size mentioned at the top.

word_index_reverse = dict([(value, key) for (key, value) in word_index.items()])

def decode_article(text):
    return ' '.join([word_index_reverse.get(i, '?') for i in text])

Train TensorFlow model

With tfl.keras.sequential we group a linear stack of layers into tfl.keras.Model. The first layer is an embedding layer, it stores one vector per word. Sequences of words are converted into sequences of vectors. Embeddings in NLP are mainly done to make closer words with similar meanings have similar vector representations (word embeddings are word vectors where words with similar meanings have similar vector representations).

tfl.keras.layers.Bidirectional is the bidirectional wrapper for RNNs, which helps to propagate inputs forwards and backwards through LSTM layers and then links the outputs. This is good for learning long-term dependencies in LSTMs. To do classification, we then form it into a dense neural network.

The activation functions we’re using here are relu and softmax. The relu function returns 0 if it returns the negative input, but for any of the positive values of x it returns the value. To know a little bit more about relu, check the guide.

The dense layer is added with six units. The final layer is the ‘softmax’ activation function, which normalizes the output of the network to a probability distribution over predicted output classes.

model = tfl.keras.Sequential([
    tfl.keras.layers.Embedding(vocab_size, embed_dims),
    tfl.keras.layers.Dense(embed_dims, activation='relu'),
    tfl.keras.layers.Dense(6, activation='softmax')

As you can see above in the model summary, we have an embedding layer and bidirectional LSTM. Output from bidirectional is double what we put in LSTM. 

The loss function I’ve used here is categorical_cross_entropy, usually used in multi-class classification tasks. It mainly quantifies the difference between two probability distributions. The optimizer we’re using is ‘adam’, a variant of gradient descent.

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

ML model development organized using Neptune

To log training metrics to Neptune, we use a callback from the Neptune library. For example, as shown below log metadata from Tensorflow / Keras and NeptuneCallback. This helps you log most metadata you would normally log in these ML libraries:

from import NeptuneCallback

neptune_clbk = NeptuneCallback(run=run, base_namespace='metrics')

epochs_count = 10
history =, training_label_seq, epochs=epochs_count, validation_data=(valdn_padded, validation_label_seq), verbose=2, callbacks=[neptune_clbk])

 Now, when you run this, all your metrics and losses will be logged in Neptune.

We can also monitor RAM and CPU usage as part of model training. The information can be found in the monitoring section of the experiments.

Establish a baseline for model performance. Start with a simple model using the initial data pipeline. Find the state of the area model for the problem in your domain and then reproduce the results. Later, apply your datasets to the next baseline.           

Model versioning

tfl.keras.models.save_model(model, 'classification.h5', overwrite=True, include_optimizer=True, save_format=None,
        signatures=None, options=None, save_traces=True)'my_model')


The model is saved in the saved_model directory in Neptune.

Log whatever from the project

def graphs_plotting(history, string):
    plt.legend([string, 'val_'+string])
graphs_plotting(history, 'accuracy')
graphs_plotting(history, 'loss')
#Define parameters as a Python dictionary and log them all at once to a Namespace of a Run. 

PARAMS = {'epoch_num': 10,
          'batch_size': 64,
          'optimizer': 'adam',
          'loss_fun': 'categorical_corss_entropy',
          'metrics': ['accuracy'],
          'activation': 'relu'}

# Pass parameters
run['parameters'] = PARAMS

# you can 

Parameters logged can be found in Neptune.

While working on an industry project, the metrics you use can change over time based on the problem you’re working on and the domain in which the model is deployed. Logging metrics can save your team a significant amount of time.

In Neptune, all your experiments are organized in a single place. You can add tags to your experiments, keep track of exactly what you tried, compare metrics, reproduce or rerun experiments when you need to. You can sleep peacefully knowing that all your ideas are safely receding in one place.

You can also add tags to your experiments, which will help you track your experiments in a much better way. Before model deployment, make sure to have versioning in place for: model configuration, model parameters training dataset, and validation dataset. Some of the common ways to deploy ML models are to package them to a docker container, and also – for inference exposure – to the REST API.

You can also compare multiple versions of notebooks you’ve logged to Neptune.

Model refinement

The above model starts overfitting after 6 epochs. You can change the epochs and retrain your model, then log your parameters to Neptune.

As the complexity is added to your model, debug it iteratively. Performing error analysis is necessary to find where the model fails. Track how model performance scales as the amount of data is increased for training. Once you have the idea to successfully build models for your problem, later you should try getting the best performance from the model. Split your error into: 

  • avoidable bias, 
  • variance, 
  • irreducible error,
  • difference between test error and validation error.

Addressing underfitting (high bias, low variance)

Perform model-specific optimization. If your model is underfitting, then it has captured the pattern and also noise in your data, but it’s not performing well on your training as well as test data. It’s important to version your data and change model parameters. You can address underfitting by error analysis, increasing model capacity, tuning your hyperparameters, and adding new features.

Addressing overfitting 

When your model is overfitting, it performs well on training data and poorly on test data. It’s an indication that your model has high variance and low bias. Survey the literature about such problems, talk to experts in your team or people you know who might have dealt with similar problems. 

We can address overfitting by adding more training data, regularization, error analysis, tuning hyperparameters, and reducing model size.

Addressing distribution shift

Refining your model is very important because the model you have built may fail in some scenarios. There will be risks involved with using your approach in production. 

We can address distributional shifts by performing error analysis in order to determine the shift in distribution. Augment your data to more closely match the test distribution, and apply domain adaptation techniques.

Debugging ML projects

This step is mainly done to investigate why your model is performing poorly. 

There may be some implementation bugs, dataset construction issues, or bad hyperparameters. Deploy a baseline model on production data as quickly as possible. Often, live data changes in unexpected ways. It may not reflect the data you’ve used during development (commonly known as data drift). 

Deploy a simple model quickly so that you will know what you need to do in advance. This helps quicker iterations rather than slow iterations trying to make a perfect model. A random seed needs to be fixed to ensure the model training is reproducible.

With proper tracking of your experiments, some of the above challenges can be taken care of, and it will be easier to communicate your results to the team.


To build machine learning projects efficiently, start simple and gradually increase complexity. Often, Data Scientists and ML Engineers are presented with poorly expressed problems to develop ML solutions for. 

Spend a good amount of time understanding the scope of your project and define requirements clearly in advance, making your iterations better as you work towards your final goal. 



ML Metadata Store: What It Is, Why It Matters, and How to Implement It

13 mins read | Author Jakub Czakon | Updated August 13th, 2021

Most people who find this page want to improve their model-building process.

But the problems they have with storing and managing ML model metadata are different.

For some, it is messy experimentation that is the issue.

Others have already deployed the first models to production, but they don’t know how those models were created or which data was used.

Some people already have many models in production, but orchestrating model A/B testing, switching challengers and champions, or triggering, testing, and monitoring re-training pipelines is not great.

If you see yourself in one of those groups, or somewhere in between, I can tell you that ML metadata store can help with all of those things and then some. 

You may need to connect it to other MLOps tools or your CI/CD pipelines, but it will simplify managing models in most workflows. 

…but so do experiment tracking, model registry, model store, model catalog, and other model-related animals.

So what is an ML metadata store exactly, how is it different from those other model things, and how can it help you build and deploy models with more confidence?

This is what this article is about.

Also, if you are one of those people who would rather play around with things to see what they are, you can check out this example project in Neptune ML metadata store

But first…

Metadata management and what is ML metadata anyway? 

Before we dive into the ML metadata store, I should probably tell you what I mean by “machine learning metadata”.

When you do machine learning, there is always a model involved. It is just what machine learning is. 

It could be a classic, supervised model like a lightGBM classifier, a reinforcement learning agent, a bayesian optimization algorithm, or anything else really.

But it will take some data, run it through some numbers and output a decision. 

… and it takes a lot of work to deliver it into production. 

Continue reading ->
Depth estimation

Depth Estimation Models with Fully Convolutional Residual Networks (FCRN)

Read more
Clustering algorithms

Exploring Clustering Algorithms: Explanation and Use Cases

Read more
Visualize ML models

Visualizing Machine Learning Models: Guide and Tools

Read more
Compare models

How to Compare Machine Learning Models and Algorithms

Read more