Neptune Blog

Keras Tuner: Lessons Learned From Tuning Hyperparameters of a Real-Life Deep Learning Model

Anton Morgunov

7 min

23rd April, 2025

ML Tools

The performance of your machine learning model depends on your configuration. Finding an optimal configuration, both for the model and for the training algorithm, is a big challenge for every machine learning engineer.

Model configuration can be defined as a set of hyperparameters which influences model architecture. In case of deep learning, these can be things like number of layers, or types of activation functions. Training algorithm configuration, on the other hand, influences the speed and quality of the training process. You can think of learning rate value as a good example of parameters in a training configuration.

To select the right set of hyperparameters, we do hyperparameter tuning. Even though tuning might be time- and CPU-consuming, the end result pays off, unlocking the highest potential capacity for your model.

If, like me, you’re a deep learning engineer working with TensorFlow/Keras, then you should consider using Keras Tuner. It’s a great tool that helps with hyperparameter tuning in a smart and convenient way.

In this article, I’ll tell you how I like to implement Keras Tuner in deep learning projects.

Why real projects matter

Everything that I’ll be doing is based on a real project. It’s not a toy problem, which is important to mention because you’ve probably seen other articles that aren’t based on real projects. Well, not this one!

Why is it so important to work with a project that reflects real life? It’s simple: these projects are much more complex at the core. Tools that might work well on a small synthetic problem, can perform poorly on real-life challenges. So, today I’ll show you what real value you can expect from Keras Tuner, and how to implement it in your own deep learning project.

Project description & problem statement

We’ll be doing an image segmentation task. We’ll try to segment multiple objects of interest on an image of a paper scan. Our end-goal is to extract particular pieces of text using segmentation. The below paper scan is an example of what we’re going to work with:

Keras tuner input image — *Example of an input image where we will need to segment text objects of our interest*

We can approach this problem using recent advances in computer vision. For example, U-NET, first introduced by Ronneberger, Fischer and Brox in 2015 to segment medical images, is a deep learning neural net that we can employ for our purposes. U-NET’s output is a set of masks, where each mask contains a particular object of interest.

*Real UNET’s output for an input image introduced previously. Table ordering numbers are segmented in a single mask*

Basically, for an input image that contains some objects, our deep neural net, when trained, should segment all objects of our interest and return a set of masks; each mask corresponds to an object of a particular class. If you’re interested in U-NET and how it works, I highly recommend reading this research paper.

Model & metrics selection

We won’t create a model from scratch. Since U-NET was introduced back in 2015, there are multiple implementations already available for us. Let’s take advantage of that.

I already imported the model, and I’m going to initialize an object of the model’s class. Let’s look at the documentation to see the variable parameters:

Signature:
unet(
    input_size=(512, 512, 1),
    start_neurons=64,
    net_depth=4,
    output_classes=1,
    dropout=False,
    bn_after_act=False,
    activation='mish',
    pretrained_weights=None,
)
Docstring:
Generates U-Net architecture model (refactored version)

Parameters:
    input_size : tuple (h, w, ch) for the input dimentions
    start_neurons : number of conv units in the first layer
    net_depth : number of convolution cascades including the middle cascade
    dropout : True -> use dropouts
    bn_after_act : True -> BatchNormalizations is placed after activation layers. Before otherwise
    activation : Type of activation layers: 'mish' - Mish-activation (default), 'elu' = ELU, 'lrelu' - LeakyReLU, 'relu' - ReLU
    pretrained_weights - None or path to weights-file
Return:
    U-net model
File:      ~/ml_basis/synthetic_items/models/unet.py
Type:      function

Docstring for the U-NET class that shows a set of parameters for initialization

As we can see from the docstring, there are eight parameters that define our future model. Only five parameters affect the model’s architecture. Three other parameters, input_size, output_classes and pretrained_weights, let us define size for an input image, number of output classes, and a path to weights from a previously pre-trained model respectively.

We’ll focus one the 5 parameters that sharpen a model’s architecture. This is where we’ll employ Keras Tuner to do hyperparameter tuning.

To find the best model architecture via hyperparameters tuning, we need to select a metric for model evaluation. To approach this question, let’s recup that U-NET performs binary classification for every image pixel, linking each pixel to a particular object class. Objects of interest in our problem domain are quite small compared to the image size. With that in mind we should think of a metric that best accounts for such imbalance in pixels classification. This is where F1 score does particularly well in terms of model evaluation.

Keras Tuner implementation

High-level overview of available tuners

How can we get the most out of our model using Keras Tuner? First of all, it’s important to say that there are multiple tuners in Keras. They use different algorithms for hyperparameter search. Here are the algorithms, with corresponding tuners in Keras:

kerastuner.tuners.hyperband.Hyperband for the HyperBand-based algorithm;
kerastuner.tuners.bayesian.BayesianOptimization for the Gaussian process-based algorithm;
kerastuner.tuners.randomsearch.RandomSearch for the random search tuner.

To give you an initial intuition of these methods, I can say that RandomSearch is the least efficient approach. It doesn’t learn from previously tested parameter combinations, and simply samples parameter combinations from a search space randomly.

BayesianOptimization is similar to RandomSearch in a way that they both sample a subset of hyperparameter combinations. The key difference is that BayesianOptimization doesn’t sample hyperparameter combinations randomly, it follows a probabilistic approach under the hood. This approach takes into account already tested combinations and uses this information to sample the next combination for a test.

Hyperband is an optimized version of RandomSearch in terms of search time and, therefore, resources allocation.

If you’re a curious person and want to learn more about Random Search, Bayesian Optimization and HyperBand, I definitely recommend this article.

Defining a search space and building a model

Keras tuner provides an elegant way to define a model and a search space for the parameters that the tuner will use – you do it all by creating a model builder function. To show you how easy and convenient it is, here’s how the model builder function for our project looks like:

# building a model using a model builder function
def model_builder(hp):
    """
    Build model for hyperparameters tuning

    hp: HyperParameters class instance
    """

    # defining a set of hyperparametrs for tuning and a range of values for each
    start_neurons = hp.Int(name = 'start_neurons', min_value = 16, max_value = 128, step = 16)
    net_depth = hp.Int(name = 'net_depth', min_value = 2, max_value = 6)
    dropout = hp.Boolean(name = 'dropout', default = False)
    bn_after_act = hp.Boolean(name = 'bn_after_act', default = False)
    activation = hp.Choice(name = 'activation', values = ['mish', 'elu', 'lrelu'], ordered = False)

    input_size = (544,544,3)
    target_labels = [str(i) for i in range(21)]

    # building a model
    model = u(input_size = input_size,
              start_neurons = start_neurons,
              net_depth = net_depth,
              output_classes = len(target_labels),
              dropout = dropout,
              bn_after_act = bn_after_act,
              activation = activation)

    # model compilation
    model.compile(optimizer = Adam(lr = 1e-3),
                  loss = weighted_cross_entropy,
                  metrics = [f1, precision, recall, iou])

    return model

You might have noticed that within the model builder function, there are multiple methods that we used to define a search space for hyperparameters – hp.Int, hp.Boolean and hp.Choice.

There’s nothing wild in how these methods operate, just a straightforward definition for the values that can be used in the parameter search space. What really matters is how we, as engineers, select the methods for each parameter, and define optimal ranges/options for the values to be sampled.

For example, it’s very important to carefully consider the parameters within hp.Int, giving it meaningful minimum and maximum values (min_value and max_value), and a proper step. Your goal here is not to get overwhelmed with the number of options, which would cause the tuning process to take too much time and resources. It’s also crucial not to limit the search space in such a way that the tuner won’t even consider the best possible values. Given your expertise, knowledge and problem domain, think of what values might be the best to test out.

Besides hp.Int, hp.Boolean and hp.Choice, there are also a few other options available for us to define values in the search space. You can get familiar with these methods via the documentation.

As a last step in creating a model builder function, we .compile our model before returning it.

Tuner initizaliation

By now you should already have a defined model builder function, and an idea of what algorithm you’d like to use for hyperparameter tuning. If you’re all set with the function and the algorithm, then you’re ready to initiate a tuner object.

For the image segmentation project we’re working on, I decided to stick with the Hyperband algorithm, so my initialization code looks like this:

# tuner initialization
tuner = kt.Hyperband(hypermodel = model_builder,
                     objective = kt.Objective("val_f1", direction="max"),
                     max_epochs = 20,
					 project_name='hyperband_tuner')

Four parameters are used during initialization:

hypermodel is a model builder function we defined previously;
objective is a metric that our model is trying to improve (maximize or minimize). As you might note from the above code snippet, I explicitly specify the name of the metric function of my choice (val_f1 which stands for f1 score for the validation dataset) and the direction it must go (max);
max_epochs defines the total number of epochs used to train each model. Official documentation suggests to “set this to a value slightly higher than the expected time to convergence for your largest Model”;
project_name is a path to the folder where all tuning-related results will be placed and stored.

Tuning process launch

Launching a hypertuning process is similar to fitting a model in Keras/TensorFlow, except for the fact that we use .search method on a tuner object instead of regular .fit. Here is how I kicked off the tuning job for the project:

tuner.search(training_data=train_dg,
             steps_per_epoch=batches_per_epoch,
             validation_data=valid_dg,
             validation_steps=len(glob(img_dir + '/*')) / valid_batch_size,
             epochs=50,
             shuffle=True,
             verbose=1,
             initial_epoch=0,
             callbacks=[ClearTrainingOutput()],
             use_multiprocessing=True,
             workers=6)

.search method and all of the parameters used in there should be already familiar to you, the only thing I want to point out is ClearTrainingOutput() callback, which essentially just clears out the output at the end of every training epoch. Here is the code for the ClearTrainingOutput callback:

# defining a call that will clean out output at the end of every training epoch
class ClearTrainingOutput(tf.keras.callbacks.Callback):
    def on_train_end(*args, **kwargs):
        IPython.display.clear_output(wait = True)

Getting tuning results

Here comes the most exciting part. Let’s see how well the tuner did. We’re curious what kind of results we were able to achieve, and which model configuration led to such performance.

To check the summary for the hypertuning job, we simply use .results_summary() on a tuner instance. Here is the complete code:

tuner.results_summary()

When run, the output for the tuner that we used within the project looks like this:

Results summary:

Results in hyperband_tuner
Showing 10 best trials
Objective(name='val_f1', direction='max')

Trial summary
Hyperparameters:
start_neurons: 32
net_depth: 5
dropout: False
bn_after_act: True
activation: mish
tuner/epochs: 15
tuner/initial_epoch: 0
tuner/bracket: 0
tuner/round: 0
Score: 0.9533569884300232

Trial summary
Hyperparameters:
start_neurons: 80
net_depth: 5
dropout: False
bn_after_act: True
activation: elu
tuner/epochs: 10
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.9258414387702942

Trial summary
Hyperparameters:
start_neurons: 128
net_depth: 3
dropout: True
bn_after_act: False
activation: elu
tuner/epochs: 18
tuner/initial_epoch: 3
tuner/bracket: 1
tuner/round: 1
tuner/trial_id: 5bc455f16fad434a9452c51a71c741b0
Score: 0.9170311570167542
..............
Trial summary
Hyperparameters:
start_neurons: 48
net_depth: 3
dropout: True
bn_after_act: False
activation: mish
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.5333467721939087

Trial summary
Hyperparameters:
start_neurons: 96
net_depth: 3
dropout: True
bn_after_act: True
activation: mish
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.5279057025909424

Trial summary
Hyperparameters:
start_neurons: 32
net_depth: 3
dropout: False
bn_after_act: False
activation: elu
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.5064906477928162

In the above code snippet, I truncated the output for .results_summary() to only 6 trials, showing the top 3 and the bottom 3 models. You might note that the worst model was only able to get up to 50 % for the metric of our choice (f1 score), whereas the best model led the performance to a remarkable 95.3 %.

Results & business impact

The range between the top and worst performer is more than 45 %. From such a difference, we can conclude that:

Keras Tuner did an incredible job finding the best set for model parameters, showing a twofold increase in metric growth;
We, as engineers, defined proper search space to sample from;
Keras Tuner works well not only for toy problems but, most importantly, for real-life projects.

Let me also share with the business impact that we got. Model that has been better optimized for a particular problem domain using hyperparameters tuning led our service to a more stable and accurate long-run performance. Compared to our other similar but non-optimized services, we’ve seen a 15 % increase in an acceptance rate for our service as well as 25 % growth in classification confidence level.

Final remarks

In this article, we’ve gone over a complete Keras tuner implementation for a real life project, showing its potential to grow model performance by selecting the best set of parameters.

We learned the importance of search space definition, and now know how to set up our own tuner and kick off the tuning job.

Thanks for reading!

Was the article useful?

More about Keras Tuner: Lessons Learned From Tuning Hyperparameters of a Real-Life Deep Learning Model

Check out our product resources and related articles below:

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Part 2: Instruction Fine-Tuning: Evaluation and Advanced Techniques for Efficient Training

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Live Neptune sandbox

What’s new

How Neptune Underpins Bioptimus’ Decisions in Training Biology Foundation Models

How Navier AI uses Neptune to Rapidly Iterate on Physics Foundation Models

Train FM

State of Foundation Model Training Report 2025