MLOps Blog

Debug Your TensorFlow/Keras Model: Hands-on Guide

9 min
9th August, 2023

Debugging plays a big role in the machine learning development cycle. Generally speaking, debugging is a critical stage in all types of software development—quite often it’s also painful and time-consuming.

There’s a way to make it less painful. You can start to implement debugging strategies early, keep testing your components, and this will likely result in a high-productivity environment.

Model creation begins with data exploration. Next, we select characteristics for our model and a baseline algorithm. Finally, we use various algorithms and adjust parameters to improve baseline performance…

…and this is where debugging comes into play to ensure that our baseline models do what they’re supposed to. Technically, the baseline model meets the adequate criteria to make it suitable for production release.

This might seem a bit overwhelming, building and maintaining debugging frameworks and strategies can be costly and difficult. Luckily, you can use a platform that will do that work for you. Tracking your metadata, logging different research models, comparing performance, and improving the features and characteristics of your models. 

In this article, we’re going to discuss model debugging strategies and do a real-world implementation with and its API.

Read also

Model debugging basics

Poor model performance in machine learning can be caused by a variety of factors. This makes debugging time-consuming. Models perform poorly if they have no predictive power or have suboptimal values. So, we debug our model to determine the source of the problem. Some usual causes of poor model performance include:

  • Model features lack sufficient predictive power;
  • Hyperparameters are set to suboptimal values;
  • The training dataset has flaws and anomalies that have filtered through the model;
  • Feature engineering code has bugs.

One key idea to bear in mind is that ML models keep being deployed on larger and larger tasks and datasets. The bigger the scale, the more important it is to debug your model(s). To do that, you need a plan and a set of steps to follow. Here it is:

General steps of model debugging

Below are the usual debugging patterns that are common among top influencers in Machine Learning. 

1. Check if the model predicts labels correctly

Check if your features adequately encode predictive signals. The accuracy of your model has a lot to do with how well your single features encode predictiveness. A simple and efficient way to measure it is to evaluate the linear correlations between individual features and labels using correlation matrices.     

Nonlinear correlations between features and labels, on the other hand, won’t be detected by correlation matrices. Instead, select some examples from your dataset that your model can easily learn from. Alternatively, use easily learnable synthetic data.

2. Establish a model baseline

A quick test of the quality of your model is to compare it to a baseline. A model baseline is a simple model that produces reasonable results on a task and isn’t difficult to build. When creating a new model, create a baseline by predicting the label with the simplest heuristic model you can come up with. If your trained model outperforms its baseline, you must improve it.

Baseline model Keras
“Good” results can be misleading if we compare against a weak baseline | Image source: MLCMU 

Examples of a baseline model would be:

  • Using linear model versions trained only on the most predictive features of the dataset;
  • Classification models that focus on only predicting the most common label;
  • Regression models that predict the mean value.

3. Adjust hyperparameter values 

Typically, the most targeted hyperparameters that engineers tweak first are:

  • The learning rate: The learning rate is automatically set by ML libraries. In TensorFlow, for example, the AdagradOptimizer is used by most TF Estimators, which sets the learning rate at 0.05 and then adaptively modifies the rate during training. Alternatively, if your model doesn’t converge, you can set up the values manually and choose a value between 0.0001 and 1.0.
  • The regularization penalty: If you need to reduce the size of your linear model, use L1 regularization. If you want more model stability, use L2 regularization. Increasing the stability of your model makes model training more reproducible.
  • Batch size: A mini-batch typically has a batch size of 10 to 1000. The batch size for SGD is one. The maximum batch size is determined by the amount of data that can fit in your machine’s memory. The batch size limit is determined by your data and algorithm.
  • Depth of network layers: The depth of a neural network refers to the number of layers, while the width refers to the number of neurons per layer. As the complexity of the corresponding problem increases, so should the depth and width. A common practice is to set the width of a layer to equal or less than the width of the previous layer. Tuning these values later helps optimize the model performance.

Test and debug your TensorFlow/Keras model 

Let’s tackle the practical aspect of things and actually get hands-on experience implementing the points that we mentioned above. We’ll build, train and debug a TensorFlow model that performs simple audio recognition. We’re going to use Neptune because of the fully operable extension for Keras/TensorFlow models, and we’ll explore some interesting features for managing and tracking the development of our model.

Start with the speech commands dataset

This dataset includes over 105,000 WAV audio files of people saying thirty different words. Google collected this information and made it available under a CC BY license. In Google’s words: 

The dataset is designed to aid in the training and evaluation of keyword spotting systems. Its primary goal is to provide a method for developing and testing small models that detect when a single word is spoken from a set of ten target words, with as few false positives as possible due to background noise or unrelated speech”.

Link to the official research paper for the dataset: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.

We’ll be using a smaller version of the whole dataset, and we’ll be downloading it using API.

data_dir = pathlib.Path('Documents/SpeechRecognition/data/mini_speech_commands')
if not data_dir.exists():
      cache_dir='.', cache_subdir='data')

Start your Neptune experiment:

run = neptune.init_run(project="aymane.hachcham/Speech-Recognition-TF",
                   api_token="YOUR_API_TOKEN") # your credentials

Log all relevant metadata to your Neptune dashboard:

run["config/dataset/path"] = "Documents/SpeechRecognition/data/mini_speech_commands"
run["config/dataset/size"] = 105000
run["config/dataset/total_examples"] = 8000
run["config/dataset/examples_per_label"] = 1000
run["config/dataset/list_commands"] = ["down", "go", "left", "no", "right", "stop", "up", "yes"]
Data config logs in Neptune | See in the app

You can also log audio samples to your Neptune dashboard to have all metadata in one place. Currently, Neptune supports MP3, MA4, OGA, and WAVE audio formats.

run["config/audio"] = neptune.types.File("/content/data/right_commad_Sample.wav")

Here you can check an audio sample of the dataset: A Right Command Sample.

Now we need to extract all the audio files into a list and shuffle it. Then we’ll be splitting and segregating the data into training, testing, and validation sets.

# Split the Data:
train_samples = list_samples[:6400]
val_samples = list_samples[6400: 6400 + 800]
test_samples = list_samples[-800:]

print("Training set size", len(train_samples))
print("Validation set size", len(val_samples))
print("Test set size", len(test_samples))

Investigating the data

Since the audio file in the data is formatted as a binary file, you’ll have to transform it into a numerical tensor. To do so, we’ll be using the TensorFlow Audio API which contains a bunch of handy functions like decode_wav that can decode WAV files into Tensors according to their sampling rate.

A sampling rate refers to the number of samples encoded per second in an entire audio file. Each sample represents the amplitude of the audio signal at a specific time. For example, a 16kHz sampling rate indicates a 16-bit system with values ranging from -32768 to 32767.

Let’s decode the audio file, get the corresponding label and waveform.

# Let's decode the audio file:
def decode_audio(audio_binary):
  # use TF Audio API and return first dimension only
  audio, _ =
  return tf.squeeze(audio, axis=-1)

# Get the corresponding label:
def get_label(file_path):
  # Get the label from the dataset
  parts = tf.strings.split(file_path, os.path.sep)
  return parts[-2]

# Combining function for the label and wave form:
def audio_waveform(file_path):
  label = get_label(file_path)
  audio_binary =
  waveform = decode_audio(audio_binary)
  return waveform, label

Once our functions are set up, we’ll use them to process the training data in order to obtain the waveforms and corresponding labels of all the samples. 

AUTOTUNE =  # Sampling Constant in TF for parallel calls
files_ds =
waveform_data =, num_parallel_calls=AUTOTUNE)

Plotting the waveform_data for 6 audio command samples:

Waveforms from the Dataset Keras
Waveforms and command labels from the training data

You can notice that even for the same commands, the waveforms can be quite different. This has to do with the voice pitch, tone, and other related characteristics that make each voice special and hardly reproducible. 

Checking the voice spectrograms

Spectrograms show frequency changes over time for each waveform, and they can be represented as 2D images. This is done by converting the audio into the time-frequency domain using the short-time Fourier transform (STFT).

The STFT divides the signal into time windows and performs a Fourier transform on each window, preserving some time information and returning a 2D tensor on which standard convolutions can be performed.

Luckily, TF provides us with an stft function that perfectly handles the job: tf.signal.stft

We need to set up prior parameters to use the function. First, set the frame length and frame step parameters so that the resulting spectrogram “image” is nearly square. Also, we want the respective waveforms to have the same length as the spectrograms so that when we convert waveforms to spectrograms, the results will hopefully have the same dimensions.

# Spectrogram function
def get_spectrogram(waveform_sample):
  # Padding for files with less than 16000 samples
  zero_pad = tf.zeros([16000] - tf.shape(waveform_sample), dtype=tf.float32)

  # Concatenate audio with padding so that all audio clips will be of the 
  # same length
  waveform = tf.cast(waveform_sample, tf.float32)
  equal_length = tf.concat([waveform, zero_pad], 0)
  spectrogram = tf.signal.stft(
      equal_length, frame_length=255, frame_step=128)

  spectrogram = tf.abs(spectrogram)

  # Returns the spectrogram as TF Tensor
  return spectrogram

Plot a spectrogram and its corresponding waveform side-by-side:

# Plotting loop for one sample from the training set 
for waveform, label in waveform_data.take(1):
  label = label.numpy().decode('utf-8')
  spectrogram = get_spectrogram(waveform)

  fig, axes = plt.subplots(2, figsize=(12, 8))
  timescale = np.arange(waveform.shape[0])
  axes[0].plot(timescale, waveform.numpy())
  axes[0].set_xlim([0, 16000])
  plot_spectrogram(spectrogram.numpy(), axes[1])
Waveform_Spectrogram Keras
Waveform and corresponding spectrogram side-by-side

I have an article that explains in-depth all the theory behind spectrograms and Mel spectrograms, and how to apply it to train a Conversational Intelligent Bot for TTS and STT taks: Conversational AI Architectures Powered by Nvidia: Tools Guide

Baseline Classifier 

Before getting into training the CNN network, we would be tempted to test the accuracy and performance of our sophisticated model against a simple baseline classifier. That way we would be convinced that our CNN really nails it and perfectly matches the complexity of the task. 

Two main characteristics we should bear in mind for our baseline:

  1. The baseline model should be very simple. Simple models are less prone to overfitting. If your baseline overfits it usually indicates that you attend your data before going for any more complex classifier.
  2. Baseline models are interpretable. Baseline models help you understand your data giving you an orientation for feature engineering.

In our case we’ll be using the DummyClassifier module provided by scikit learn. It is fairly simple and has all the requirements to make up for a perfect candidate.

# Transform the data to fit the Dummy Classifier
train_audio = []
train_labels = []

for audio, label in train_ds:

train_audio = np.array(train_audio)
train_labels = np.array(train_labels)

# Use our baseline model:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

dummy_clf = DummyClassifier(strategy="most_frequent"), train_labels)

Then get predictions and assess the accuracy score of our dummy classifier:
dummy_pred = dummy_clf.predict(test_audio)
dummy_true = test_labels

accuracy_score(dummy_true, dummy_pred)

Accuracy score for our Dummy Classifier gives 0.16. Which is very low compared to what a neural network could achieve. Once we train our model we’ll come to realize that the baseline result clearly demonstrates that our model performs very well and indeed surpasses the capacities of a simplistic ML classifier.

Build and train your model

Now that we’ve built our baseline model and once all our data is ready to be used for training, we’ll need to build our architecture. A simple CNN will do since the model will be trained on graphic spectrograms. Thus, the model will learn to identify the peculiarities of each sound by only relating to its spectrogram.

We’ll use a batch of 64 for the data loaders.

batch_size = 64
train_samples = train_samples.batch(batch_size)
val_samples = val_samples.batch(batch_size)

The architecture

The model has also some additional processing layers, like:

  • A Resizing Layer to downsample the input and therefore train faster.
  • A Normalization Layer to apply mean and std normalization for each input image before feeding it to the model.

If we were using Pytorch, we would normally first apply the data transformation that usually includes resizing, normalizing, cropping, etc. But, with TensorFlow, this is managed quite easily using modules designed for this purpose. 

normalization_layer = preprocessing.Normalization()
normalization_layer.adapt( x, _: x))
# Keras Model Architecture:
sound_model = models.Sequential([
    preprocessing.Resizing(32, 32),
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.Dense(128, activation='relu'),

We can also log the architecture in Neptune to save it for later runs.

# Saving the architecture to a txt file:
from contextlib import redirect_stdout

with open(f'./{model_name}_arch.txt', 'w') as f:
    with redirect_stdout(f):

# Log it to Neptune:
Model_Architecture Neptune
Model architecture saved in Neptune | See in the app

After setting up the architecture, let’s compile the model. We’ll eventually be using an Adam optimizer and a Sparse Categorical CrossEntropy loss measure to critically evaluate model accuracy over time.

hparams = {
    'batch_size': 64,
    'image_size': 120,
    'num_epochs': 10,
    'learning_rate': 0.0001,
    'beta_rate_optimizer': 0.5,
    'loss_function': tf.keras.losses.SparseCategoricalCrossentropy,
    'optimizer': tf.keras.optimizers.Adam

run["params"] = hparams

The best way to track the training progress of our model is the Neptune TF/Keras extension. It works as a callback and logs in real-time the values for our three sets simultaneously: Train, Test and Validation.


from neptune.integrations.tensorflow_keras import NeptuneCallback
neptune_callback = NeptuneCallback(run=run, base_namespace="Sound Recognition")

history =

Below are the results obtained for loss and accuracy for each one of the three sets.

Accuracy_Training Keras
Accuracy, Loss for the Training set | See in the app
Accuracy_Training Keras
Accuracy, Loss for the Test set | See in the app
Accuracy_Training Keras
Accuracy, Loss for the Validation set | See in the app

One way to debug and assert the efficiency of our training is to train several more times and compare the results in terms of loss and accuracy.

Debugging the model

Tweaking model hyperparameters (number of epochs and the learning rate) shows us the model’s progression, and whether those parameters have any serious impact on performance.

Tweaking model hyperparameters (number of epochs and the learning rate) shows us the model’s progression, and whether those parameters have any serious impact on performance.

Runs comparison Neptune Keras
Comparison for three runs | See in the app

As you can see, the difference is minimal. This means the overall training happens to be quite similar, and the change in model parameters can neither be considered a turning point in model improvement, nor in model performance.

We can also display the confusion matrix to check how the model does on each of the commands of the test set. It says how accurate the model is when predicting each command, and shows if the model has a general understanding of the differences between each command.

y_pred = np.argmax(model.predict(test_audio), axis=1) # Predictions
y_true = test_labels # Ground truth

# Display Confusion Matrix:
confusion_mtx = tf.math.confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(confusion_mtx, xticklabels=commands, yticklabels=commands,
            annot=True, fmt='g')
Consfuion matrix Keras
Confusion matrix

You can see that our model does quite well.

Model refinement 

Iteratively debug your model as it grows in complexity. Error analysis is required to find where the model fails. Keep track of how model performance scales as the amount of data used for training increases. 

After you’ve successfully built a model for your problem, you should try to get the best performance out of the model. Track most of your potential errors by always following these basic rules:

  • Avoid any unnecessary bias
  • Remember that there will always be an irreducible error 
  • Never confuse test error validation error

Interesting debugging strategies to implement 

Sensitivity analysis 

Sensitivity analysis is a statistical technique used to determine how sensitive a model, parameter, or other quantity is to changes in input parameters from their nominal values. This analysis demonstrates how a model reacts to unknown data and what it predicts based on given data. It’s often referred to as “What if” analysis by developers.

See an example here: TensorFlow tutorial on Model’s Specificity and Sensitivity 

Model benchmarking

A benchmark model is simple to implement and doesn’t take much time. Use any standard algorithm to find a suitable benchmark model, and then simply compare the results to model predictions. If there are many similarities between standard algorithms and ML algorithms, a simple regression may already reveal potential problems with the algorithm.

Check an example here: A Way to Benchmark Your Deep Learning Framework On-premise 

Final thoughts 

We’ve explored debugging mechanisms and strategies that are great for experimenting with machine learning models, and we did a practical example of how you could analyse and track model performance using Neptune.

I’m leaving you with some additional resources. Don’t forget to check my other articles, and feel free to contact me for any questions you might have. 

Don’t forget to check all the code for this article in the Colab Notebook: Simple Sound Recognition


Was the article useful?

Thank you for your feedback!