Debug Your TensorFlow/Keras Model: Hands-on Guide
Debugging plays a big role in the machine learning development cycle. Generally speaking, debugging is a critical stage in all types of software developmentâquite often itâs also painful and time-consuming.
Thereâs a way to make it less painful. You can start to implement debugging strategies early, keep testing your components, and this will likely result in a high-productivity environment.
Model creation begins with data exploration. Next, we select characteristics for our model and a baseline algorithm. Finally, we use various algorithms and adjust parameters to improve baseline performance…
…and this is where debugging comes into play to ensure that our baseline models do what theyâre supposed to. Technically, the baseline model meets the adequate criteria to make it suitable for production release.
This might seem a bit overwhelming, building and maintaining debugging frameworks and strategies can be costly and difficult. Luckily, you can use a platform that will do that work for you. Tracking your metadata, logging different research models, comparing performance, and improving the features and characteristics of your models.
In this article, weâre going to discuss model debugging strategies and do a real-world implementation with neptune.ai and its API.
Read also
Model debugging basics
Poor model performance in machine learning can be caused by a variety of factors. This makes debugging time-consuming. Models perform poorly if they have no predictive power or have suboptimal values. So, we debug our model to determine the source of the problem. Some usual causes of poor model performance include:
- Model features lack sufficient predictive power;
- Hyperparameters are set to suboptimal values;
- The training dataset has flaws and anomalies that have filtered through the model;
- Feature engineering code has bugs.
One key idea to bear in mind is that ML models keep being deployed on larger and larger tasks and datasets. The bigger the scale, the more important it is to debug your model(s). To do that, you need a plan and a set of steps to follow. Here it is:
General steps of model debugging
Below are the usual debugging patterns that are common among top influencers in Machine Learning.
1. Check if the model predicts labels correctly
Check if your features adequately encode predictive signals. The accuracy of your model has a lot to do with how well your single features encode predictiveness. A simple and efficient way to measure it is to evaluate the linear correlations between individual features and labels using correlation matrices.
Nonlinear correlations between features and labels, on the other hand, wonât be detected by correlation matrices. Instead, select some examples from your dataset that your model can easily learn from. Alternatively, use easily learnable synthetic data.
2. Establish a model baseline
A quick test of the quality of your model is to compare it to a baseline. A model baseline is a simple model that produces reasonable results on a task and isnât difficult to build. When creating a new model, create a baseline by predicting the label with the simplest heuristic model you can come up with. If your trained model outperforms its baseline, you must improve it.

Examples of a baseline model would be:
- Using linear model versions trained only on the most predictive features of the dataset;
- Classification models that focus on only predicting the most common label;
- Regression models that predict the mean value.
3. Adjust hyperparameter values
Typically, the most targeted hyperparameters that engineers tweak first are:
- The learning rate: The learning rate is automatically set by ML libraries. In TensorFlow, for example, the AdagradOptimizer is used by most TF Estimators, which sets the learning rate at 0.05 and then adaptively modifies the rate during training. Alternatively, if your model doesn’t converge, you can set up the values manually and choose a value between 0.0001 and 1.0.
- The regularization penalty: If you need to reduce the size of your linear model, use L1 regularization. If you want more model stability, use L2 regularization. Increasing the stability of your model makes model training more reproducible.
- Batch size: A mini-batch typically has a batch size of 10 to 1000. The batch size for SGD is one. The maximum batch size is determined by the amount of data that can fit in your machine’s memory. The batch size limit is determined by your data and algorithm.
- Depth of network layers: The depth of a neural network refers to the number of layers, while the width refers to the number of neurons per layer. As the complexity of the corresponding problem increases, so should the depth and width. A common practice is to set the width of a layer to equal or less than the width of the previous layer. Tuning these values later helps optimize the model performance.
Test and debug your TensorFlow/Keras model
Letâs tackle the practical aspect of things and actually get hands-on experience implementing the points that we mentioned above. Weâll build, train and debug a TensorFlow model that performs simple audio recognition. Weâre going to use Neptune because of the fully operable extension for Keras/TensorFlow models, and weâll explore some interesting features for managing and tracking the development of our model.
Start with the speech commands dataset
This dataset includes over 105,000 WAV audio files of people saying thirty different words. Google collected this information and made it available under a CC BY license. In Googleâs words:
âThe dataset is designed to aid in the training and evaluation of keyword spotting systems. Its primary goal is to provide a method for developing and testing small models that detect when a single word is spoken from a set of ten target words, with as few false positives as possible due to background noise or unrelated speechâ.
Link to the official research paper for the dataset: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.
Weâll be using a smaller version of the whole dataset, and weâll be downloading it using TensorFlow.data API.
data_dir = pathlib.Path('Documents/SpeechRecognition/data/mini_speech_commands')
if not data_dir.exists():
tf.keras.utils.get_file(
'mini_speech_commands.zip',
origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
extract=True,
cache_dir='.', cache_subdir='data')
Start your Neptune experiment:
run = neptune.init_run(project="aymane.hachcham/Speech-Recognition-TF",
api_token="YOUR_API_TOKEN") # your credentials
Log all relevant metadata to your Neptune dashboard:
run["config/dataset/path"] = "Documents/SpeechRecognition/data/mini_speech_commands"
run["config/dataset/size"] = 105000
run["config/dataset/total_examples"] = 8000
run["config/dataset/examples_per_label"] = 1000
run["config/dataset/list_commands"] = ["down", "go", "left", "no", "right", "stop", "up", "yes"]

You can also log audio samples to your Neptune dashboard to have all metadata in one place. Currently, Neptune supports MP3, MA4, OGA, and WAVE audio formats.
run["config/audio"] = neptune.types.File("/content/data/right_commad_Sample.wav")
Here you can check an audio sample of the dataset: A Right Command Sample.
Now we need to extract all the audio files into a list and shuffle it. Then weâll be splitting and segregating the data into training, testing, and validation sets.
# Split the Data:
train_samples = list_samples[:6400]
val_samples = list_samples[6400: 6400 + 800]
test_samples = list_samples[-800:]
print("Training set size", len(train_samples))
print("Validation set size", len(val_samples))
print("Test set size", len(test_samples))
Investigating the data
Since the audio file in the data is formatted as a binary file, youâll have to transform it into a numerical tensor. To do so, weâll be using the TensorFlow Audio API which contains a bunch of handy functions like decode_wav that can decode WAV files into Tensors according to their sampling rate.
A sampling rate refers to the number of samples encoded per second in an entire audio file. Each sample represents the amplitude of the audio signal at a specific time. For example, a 16kHz sampling rate indicates a 16-bit system with values ranging from -32768 to 32767.
Letâs decode the audio file, get the corresponding label and waveform.
# Let's decode the audio file:
def decode_audio(audio_binary):
# use TF Audio API and return first dimension only
audio, _ = tf.audio.decode_wav(audio_binary)
return tf.squeeze(audio, axis=-1)
# Get the corresponding label:
def get_label(file_path):
# Get the label from the dataset
parts = tf.strings.split(file_path, os.path.sep)
return parts[-2]
# Combining function for the label and wave form:
def audio_waveform(file_path):
label = get_label(file_path)
audio_binary = tf.io.read_file(file_path)
waveform = decode_audio(audio_binary)
return waveform, label
Once our functions are set up, weâll use them to process the training data in order to obtain the waveforms and corresponding labels of all the samples.
AUTOTUNE = tf.data.AUTOTUNE # Sampling Constant in TF for parallel calls
files_ds = tf.data.Dataset.from_tensor_slices(train_samples)
waveform_data = files_ds.map(audio_waveform, num_parallel_calls=AUTOTUNE)
Plotting the waveform_data for 6 audio command samples:

You can notice that even for the same commands, the waveforms can be quite different. This has to do with the voice pitch, tone, and other related characteristics that make each voice special and hardly reproducible.
Checking the voice spectrograms
Spectrograms show frequency changes over time for each waveform, and they can be represented as 2D images. This is done by converting the audio into the time-frequency domain using the short-time Fourier transform (STFT).
The STFT divides the signal into time windows and performs a Fourier transform on each window, preserving some time information and returning a 2D tensor on which standard convolutions can be performed.
Luckily, TF provides us with an stft function that perfectly handles the job: tf.signal.stft
We need to set up prior parameters to use the function. First, set the frame length and frame step parameters so that the resulting spectrogram “image” is nearly square. Also, we want the respective waveforms to have the same length as the spectrograms so that when we convert waveforms to spectrograms, the results will hopefully have the same dimensions.
# Spectrogram function
def get_spectrogram(waveform_sample):
# Padding for files with less than 16000 samples
zero_pad = tf.zeros([16000] - tf.shape(waveform_sample), dtype=tf.float32)
# Concatenate audio with padding so that all audio clips will be of the
# same length
waveform = tf.cast(waveform_sample, tf.float32)
equal_length = tf.concat([waveform, zero_pad], 0)
spectrogram = tf.signal.stft(
equal_length, frame_length=255, frame_step=128)
spectrogram = tf.abs(spectrogram)
# Returns the spectrogram as TF Tensor
return spectrogram
Plot a spectrogram and its corresponding waveform side-by-side:
# Plotting loop for one sample from the training set
for waveform, label in waveform_data.take(1):
label = label.numpy().decode('utf-8')
spectrogram = get_spectrogram(waveform)
fig, axes = plt.subplots(2, figsize=(12, 8))
timescale = np.arange(waveform.shape[0])
axes[0].plot(timescale, waveform.numpy())
axes[0].set_title('Waveform')
axes[0].set_xlim([0, 16000])
plot_spectrogram(spectrogram.numpy(), axes[1])
axes[1].set_title('Spectrogram')
plt.show()

I have an article that explains in-depth all the theory behind spectrograms and Mel spectrograms, and how to apply it to train a Conversational Intelligent Bot for TTS and STT taks: Conversational AI Architectures Powered by Nvidia: Tools Guide
Baseline Classifier
Before getting into training the CNN network, we would be tempted to test the accuracy and performance of our sophisticated model against a simple baseline classifier. That way we would be convinced that our CNN really nails it and perfectly matches the complexity of the task.
Two main characteristics we should bear in mind for our baseline:
- The baseline model should be very simple. Simple models are less prone to overfitting. If your baseline overfits it usually indicates that you attend your data before going for any more complex classifier.
- Baseline models are interpretable. Baseline models help you understand your data giving you an orientation for feature engineering.
In our case weâll be using the DummyClassifier module provided by scikit learn. It is fairly simple and has all the requirements to make up for a perfect candidate.
# Transform the data to fit the Dummy Classifier
train_audio = []
train_labels = []
for audio, label in train_ds:
train_audio.append(audio.numpy())
train_labels.append(label.numpy())
train_audio = np.array(train_audio)
train_labels = np.array(train_labels)
# Use our baseline model:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(train_audio, train_labels)
Then get predictions and assess the accuracy score of our dummy classifier:
dummy_pred = dummy_clf.predict(test_audio)
dummy_true = test_labels
accuracy_score(dummy_true, dummy_pred)
Accuracy score for our Dummy Classifier gives 0.16. Which is very low compared to what a neural network could achieve. Once we train our model weâll come to realize that the baseline result clearly demonstrates that our model performs very well and indeed surpasses the capacities of a simplistic ML classifier.
Build and train your model
Now that weâve built our baseline model and once all our data is ready to be used for training, weâll need to build our architecture. A simple CNN will do since the model will be trained on graphic spectrograms. Thus, the model will learn to identify the peculiarities of each sound by only relating to its spectrogram.
Weâll use a batch of 64 for the data loaders.
batch_size = 64
train_samples = train_samples.batch(batch_size)
val_samples = val_samples.batch(batch_size)
The architecture
The model has also some additional processing layers, like:
- A Resizing Layer to downsample the input and therefore train faster.
- A Normalization Layer to apply mean and std normalization for each input image before feeding it to the model.
If we were using Pytorch, we would normally first apply the data transformation that usually includes resizing, normalizing, cropping, etc. But, with TensorFlow, this is managed quite easily using modules designed for this purpose.
normalization_layer = preprocessing.Normalization()
normalization_layer.adapt(spectrogram_ds.map(lambda x, _: x))
# Keras Model Architecture:
sound_model = models.Sequential([
layers.Input(shape=input_shape),
preprocessing.Resizing(32, 32),
norm_layer,
layers.Conv2D(32, 3, activation='relu'),
layers.Conv2D(64, 3, activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_labels),
])
We can also log the architecture in Neptune to save it for later runs.
# Saving the architecture to a txt file:
from contextlib import redirect_stdout
with open(f'./{model_name}_arch.txt', 'w') as f:
with redirect_stdout(f):
model.summary()
# Log it to Neptune:
run[f"io_files/artifacts/{model_name}_arch"].upload(f"./{model_name}_arch.txt")

After setting up the architecture, letâs compile the model. Weâll eventually be using an Adam optimizer and a Sparse Categorical CrossEntropy loss measure to critically evaluate model accuracy over time.
hparams = {
'batch_size': 64,
'image_size': 120,
'num_epochs': 10,
'learning_rate': 0.0001,
'beta_rate_optimizer': 0.5,
'loss_function': tf.keras.losses.SparseCategoricalCrossentropy,
'optimizer': tf.keras.optimizers.Adam
}
run["params"] = hparams
The best way to track the training progress of our model is the Neptune TF/Keras extension. It works as a callback and logs in real-time the values for our three sets simultaneously: Train, Test and Validation.
sound_model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'],
)
from neptune.integrations.tensorflow_keras import NeptuneCallback
neptune_callback = NeptuneCallback(run=run, base_namespace="Sound Recognition")
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=hparams["num_epochs"],
callbacks=[neptune_callback],
)
Below are the results obtained for loss and accuracy for each one of the three sets.



One way to debug and assert the efficiency of our training is to train several more times and compare the results in terms of loss and accuracy.
Debugging the model
Tweaking model hyperparameters (number of epochs and the learning rate) shows us the modelâs progression, and whether those parameters have any serious impact on performance.
Tweaking model hyperparameters (number of epochs and the learning rate) shows us the modelâs progression, and whether those parameters have any serious impact on performance.

As you can see, the difference is minimal. This means the overall training happens to be quite similar, and the change in model parameters can neither be considered a turning point in model improvement, nor in model performance.
We can also display the confusion matrix to check how the model does on each of the commands of the test set. It says how accurate the model is when predicting each command, and shows if the model has a general understanding of the differences between each command.
y_pred = np.argmax(model.predict(test_audio), axis=1) # Predictions
y_true = test_labels # Ground truth
# Display Confusion Matrix:
confusion_mtx = tf.math.confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(confusion_mtx, xticklabels=commands, yticklabels=commands,
annot=True, fmt='g')
plt.xlabel('Prediction')
plt.ylabel('Label')
plt.show()

You can see that our model does quite well.
Model refinement
Iteratively debug your model as it grows in complexity. Error analysis is required to find where the model fails. Keep track of how model performance scales as the amount of data used for training increases.
After you’ve successfully built a model for your problem, you should try to get the best performance out of the model. Track most of your potential errors by always following these basic rules:
- Avoid any unnecessary bias
- Remember that there will always be an irreducible error
- Never confuse test error validation error
Interesting debugging strategies to implement
Sensitivity analysis
Sensitivity analysis is a statistical technique used to determine how sensitive a model, parameter, or other quantity is to changes in input parameters from their nominal values. This analysis demonstrates how a model reacts to unknown data and what it predicts based on given data. Itâs often referred to as “What if” analysis by developers.
See an example here: TensorFlow tutorial on Modelâs Specificity and Sensitivity
Model benchmarking
A benchmark model is simple to implement and doesnât take much time. Use any standard algorithm to find a suitable benchmark model, and then simply compare the results to model predictions. If there are many similarities between standard algorithms and ML algorithms, a simple regression may already reveal potential problems with the algorithm.
Check an example here: A Way to Benchmark Your Deep Learning Framework On-premise
Final thoughts
Weâve explored debugging mechanisms and strategies that are great for experimenting with machine learning models, and we did a practical example of how you could analyse and track model performance using Neptune.
Iâm leaving you with some additional resources. Donât forget to check my other articles, and feel free to contact me for any questions you might have.
Donât forget to check all the code for this article in the Colab Notebook: Simple Sound Recognition
Resources:
- Testing and Debugging in Machine Learning, Google blog.
- How to Keep Track of TensorFlow/Keras Model Development with Neptune
- The Ultimate Guide to Debugging your Machine Learning models
- Model Debugging Strategies â Machine Learning Guide