How to Build a Lightweight Image Classifier in TensorFlow / Keras
Computer vision is a rapidly developing field where tremendous progress is being made, but there are still many challenges that computer vision engineers need to tackle.
First of all, their end models need to be robust and accurate.
Secondly, the final solution should be fast enough and, ideally, achieve near real-time performance.
Lastly, the model should occupy as few computational and memory resources as possible.
Luckily for us, there are many state-of-the-art algorithms to choose from. Some are best in terms of accuracy, some are the fastest, others are incredibly compact. The arsenal is indeed very rich, and computer vision engineers have a lot of options to consider.
Tasks that Computer Vision solves
Image classification

In this article, we’ll focus on creating image classifiers. Image classification is a basic task, but still one of the most important tasks that computer vision engineers can tackle.
To classify an image means to determine a class. The number of classes is limited to the amount of image types you’d like to distinguish. For example, you might want to classify images based on vehicle types. Possible options for classes might be: a bike, a car, a bus and a truck.
Alternatively, you might want a more detailed set, and to decompose high-level classes into low-level sub-classes. With this approach, your class list might include a sport bike, a chopper, a scooter, and an ATV for bikes. Cars category will be further split into a hatchback, a pickup, a compact, a sports coupe, a crossover, and a van. Similar breakup can be done for busses and trucks.
The final set of classes is determined by a computer vision engineer who knows the problem domain well, and is familiar with the dataset and available annotations.
Learn more
Image Classification: Tips and Tricks From 13 Kaggle Competitions (+ Tons of References)
Object detection
What if an image that you work with has more than one class associated with it? Continuing our previous example, a picture might contain both a bus and a scooter. You definitely don’t want to miss any object, and want to catch both. Here, object detection comes into play.
In computer vision, to detect an object means to localize it and assign a class to it. In simple words, we want to find an object on an image and identify it. As you can see, object detection contains the image classification part in it, since we do classification after the object was localized. It emphasizes the fact that image classification lies at the core of computer vision and therefore needs to be carefully learned.
Read also
How to Train Your Own Object Detector Using TensorFlow Object Detection API
Semantic segmentation

Lastly, let’s also briefly discuss semantic segmentation where classification is performed for each pixel of an image. If a pixel belongs to a particular object then this pixel is classified positive for this object. The object is segmented by classifying each pixel of this object as positive. It again highlights the importance of classification.
Of course, there are many other tasks in computer vision that we won’t touch today, but believe me, image classification is the most central one.
Might interest you
Image Segmentation in 2021: Architectures, Losses, Datasets, and Frameworks
What makes image classification possible
Before we move on to the practical guide on how to create a lightweight classifier, I decided to stop for a while and get back to the theoretical part of the problem.
This decision was deliberate, and comes from my daily observations that more and more computer vision engineers tend to approach problems at a very high level. Usually, this implies a simple pre-trained model import, setting up accuracy as a metric, and launching the training job. Thanks to extensive work that was done creating the most common machine learning frameworks (scikit-learn, PyTorch or Tensorflow), nowadays this is possible.
Do such advances in these frameworks imply that there’s no need to dig into the mathematics under the hood of the algorithms? Absolutely not! Knowing the basics of what makes image classification possible is a superpower that you can, and should definitely get.

When everything goes well, you might feel like there’s no need to go beyond the basic model import. But, whenever you face a difficulty is when understanding algorithms will serve you well.
No need to become a mastermind, either. For deep learning, all that you need to grasp is just the concept of convolutional neural nets. Once you learn this, you’ll be all set to tackle any problem that you might encounter when training a machine learning model. And believe me, there will be lots of situations like this if you’re on a machine learning career track.
Note: I won’t go over CNNs in detail in this article, I’d like to explain other things. But, I’ll share some resources that I found on the web, and think they’re quite useful for understanding CNNs. You can find them in the references section.
Overview of the best Convolutional Neural Nets

When the concept of CNNs is clear, you might wonder which CNNs available today perform the best. That’s what we’re going to talk about in this section.
Even though the first CNNs were introduced back in the 1980s, the real breakthrough happened in the 2000s with the availability of graphics processing units (GPUs).
In order to see how fast the progress went, it’s worth looking at the statistics for the error rate achieved at the very well known annual competition called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Here’s how error rate history has been changing over time:

From 2016 to now, there have been lots of advances regarding CNNs.
One architecture that deserves our attention today was introduced back in May 2019 by researchers from Google AI. An article called “EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling” was posted on the Google AI blog. The team of researches invented a brand new CNN architecture called EfficientNet, which blew the mind of every computer vision engineer.
It turned out so good that it outperformed all previous state-of-the-art architectures in all respects: accuracy, speed and net size. Pretty impressive!
Today, I’d like to show you how straightforward it is to take advantage of Google’s latest invention, and apply it to your classification problem in an easy way, as long as you work in the Tensorflow / Keras frameworks.
Read also
Guide to Building Your Own Neural Network [With Breast Cancer Classification Example]
TensorFlow & Keras Frameworks in Machine Learning
Frameworks are essential in every information technology domain. Machine learning is no exception. There are several established players in the ML market which help us simplify the overall programming experience. PyTorch, scikit-learn, TensorFlow/Keras, MXNet and Caffe are just a few worth mentioning.
Today, I’d like to focus on TensorFlow/Keras. Not surprisingly, these two are among the most popular frameworks in the machine learning universe. It’s largely due to the fact that both TensorFlow and Keras provide rich capabilities for development. Two frameworks are quite similar to each other. Without digging too much into details, the key takeaway that you should know is that the former (Keras) is just a wrapper for the TensorFlow framework.
In relation to convolutional neural nets, Keras lets us import and build our models using the latest CNN architectures in the machine learning world. Check out the official documentation page, where you’ll find the complete library of pre-trained models available in Keras for fine-tuning.
Image classifier creation: real-life project example
Project description
All right, let’s take advantage of the pre-trained models available in Keras, and solve a real-life computer vision problem. The project that we’re going to work with is intended to tackle an image orientation question. We need to create a model that can classify orientation for input images. There are four orientation options for input images:
- Normal,
- Rotated 90 degrees counterclockwise,
- Rotated 180 degrees counterclockwise,
- Rotated 270 degrees counterclockwise.
Given four orientations for the input images, we can conclude that there will be four classes that the model should be able to distinguish.

The model isn’t supposed to detect the orientation of any image. The set of images that the model will work with is limited to only one type, as presented above.
The dataset contains around 11 thousands images, all checked and confirmed to be in normal orientation.
Data generator creation

Since all of the images in the dataset are in normal orientation, we need to rotate them before feeding to the neural net, in order to ensure each class is represented. To do that, we use a custom image generator.
The custom generator works as following:
- A single image from the dataset is read given a path to it;
- Image is rotated to one of four orientations. An orientation for rotation is selected randomly. Each orientation is sampled with an equal probability, resulting in the balance for the four input classes;
- Rotated image is augmented using a predefined set of augmentation methods;
- Rotated and augmented image is preprocessed using a preprocessing function passed;
- Rotated and augmented images are stacked to create a batch of a given size;
- When the batch is formed, it’s yielded and fed into the neural net.
Here’s the complete code for the custom data generator used in the project:
class DataGenerator(Sequence):
"""
Generates rotated images for the net that detects orientation
"""
def __init__(self,
data_folder,
target_samples,
preprocessing_f,
input_size,
batch_size,
shuffle,
aug):
"""
Initialization
:data_folder: path to folder with images (all images: both train and valid)
:target_samples: an array of basenames for images to use within generator (e.g.: only those for train)
:preprocessing_f: input preprocessing function
:input_size: (typle, (width, height) format) image size to be fed into the neural net
:batch_size: (int) batch size at each iteration
:shuffle: True to shuffle indices after each epoch
:aug: True to augment input images
"""
self.data_folder = data_folder
self.target_samples = target_samples
self.preprocessing_f = preprocessing_f
self.input_size = input_size
self.batch_size = batch_size
self.shuffle = shuffle
self.aug = aug
self.on_epoch_end()
def __len__(self):
"""
Denotes the number of batches per epoch
:return: nuber of batches per epoch
"""
return math.ceil(len(self.target_samples) / self.batch_size)
def __getitem__(self, index):
"""
Generates a batch of data (X and Y)
"""
indices = self.indices[index * self.batch_size : (index + 1) * self.batch_size]
images_bn4batch = [self.target_samples[i] for i in indices]
path2images4batch = [os.path.join(self.data_folder, im_bn) for im_bn in images_bn4batch]
images4batch_bgr = [cv2.imread(path2image) for path2image in path2images4batch]
images4batch_rgb = [cv2.cvtColor(bgr_im, cv2.COLOR_BGR2RGB) for bgr_im in images4batch_bgr]
if self.aug:
angle4rotation = 2
images4batch_aug = [self.__data_augmentation(im, angle4rotation) for im in images4batch_rgb]
else:
images4batch_aug = images4batch_rgb
rotated_images, labels = self.__data_generation(images4batch_aug)
images4batch_resized = [cv2.resize(rotated_im, self.input_size) for rotated_im in rotated_images]
if self.preprocessing_f:
prep_images4batch = [self.preprocessing_f(resized_im) for resized_im in images4batch_resized]
else:
prep_images4batch = images4batch_resized
images4yielding = np.array(prep_images4batch)
return images4yielding, labels
def on_epoch_end(self):
"""
Updates indices after each epoch
"""
self.indices = np.arange(len(self.target_samples))
if self.shuffle:
np.random.shuffle(self.indices) #inplace shuffling
def __data_generation(self, images):
"""
Applies random image rotation and geterates labels.
Labels map: counter clockwise direction! 0 = 0, 90 = 1, 180 = 2, 270 = 3
:return: rotated_images, labels
"""
labels = np.random.choice([0,1,2,3], size=len(images), p=[0.25] * 4)
rotated_images = [np.rot90(im, angle) for im, angle in zip(images, labels)]
return rotated_images, labels
def __data_augmentation(self, image, max_rot_angle = 2):
"""
Applies data augmentation
:max_rot_angle: maximum angle that can be selected for image rotation
:return: augmented_images
"""
rotation_options = np.arange(-1 * max_rot_angle, max_rot_angle + 1, 1)
angle4rotation = np.random.choice(rotation_options, 1)
sometimes = lambda aug: iaa.Sometimes(0.5)
seq = iaa.Sequential(
[
iaa.OneOf(
[iaa.Add((-15,15), per_channel=False),
iaa.Multiply((0.8, 1.2)),
iaa.MultiplyHueAndSaturation((0.8,1.1))
]),
iaa.OneOf(
[iaa.AdditiveGaussianNoise(loc=0, scale=(0.02, 0.05*255), per_channel=0.5),
iaa.AdditiveLaplaceNoise(loc=0,scale=(0.02, 0.05*255), per_channel=0.5),
iaa.AdditivePoissonNoise(lam=(8,16), per_channel=0.5),
]),
iaa.OneOf(
[iaa.Dropout(p=0.005,per_channel=False),
iaa.Pepper(p=0.005),
iaa.Salt(p=0.01)
]),
sometimes(
iaa.FrequencyNoiseAlpha(
exponent=(-1,2),
first=iaa.Multiply((0.9, 1.1), per_channel=False),
second=iaa.ContrastNormalization((0.8,1.2)))
),
iaa.OneOf([
iaa.GaussianBlur((0.5, 1)), # blur images with a sigma between 0 and 3.0
iaa.AverageBlur(k=(3, 5)), # blur image using local means with kernel sizes between 2 and 7
iaa.MotionBlur(k=(5, 7), angle=(0, 359)), # blur image using local means with kernel sizes between 2 and 7
iaa.MedianBlur(k=(5, 7)), # blur image using local medians with kernel sizes between 2 and 7
]),
sometimes(iaa.JpegCompression((60,80))),
iaa.OneOf(
[iaa.GammaContrast((0.7,1.3)),
iaa.GammaContrast((0.7,1.3),per_channel=True),
iaa.SigmoidContrast(gain=(5,8)),
iaa.LogContrast((0.6,1)),
iaa.LinearContrast((0.6,1.4))
]),
sometimes(
iaa.Affine(rotate = angle4rotation, mode = 'edge')
)
])
img_aug = seq(images = [image])[0]
return img_aug
Image augmentation should be considered carefully. For instance, our project assumes no geometrical transformation. Crops, flips and wapring should be excluded from the possible options for two reasons:
- Such augmentation can affect an input image orientation, as a result, we might get invalid model predictions.
- Such transformations don’t reflect the images that the model is intended to process in the future. That’s not always the case, but for this particular project the model will receive images that have no geometrical alterations. That is why there’s no need to apply these variations synthetically when training a model.
Other augmentation types that work mostly on a pixel-wise level should be applied. It implies slight changes in, for example, color and contrast. Blurring, blending and pooling are also applicable. To learn more about various augmentation techniques, take a look at the imgaug GitHub page.
What’s worth mentioning, there are two data generators that are going to be initialized for the project:
- the first one is intended to generate training data,
- the second one will be used to produce validation data.
The data split between training and validation sets has a 8:2 ratio. 80 % of all data will be dedicated to training the model, and 20 % will be used for model evaluation.
Neural Net architecture design
We will start designing our future image classifier by importing a backbone to work with. Keras gives us access to its model Zoo with multiple CNNs available for import. Our goal is to create a lightweight classifier, so we definitely should consider EfficientNet, which is highly efficient and accurate.
To import EfficientNet, first you have to decide which depth to go with. EfficientNet has 8 depth levels, starting from B0 (the baseline), and finishing up with the deepest B7. EfficientNet-B7 reaches state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on the ImageNet dataset.

Performance metrics shown on ImageNet dataset.
To select the proper depth level, you should always consider the complexity of your problem. From my personal experience, going with either B1 or B2 is a good starting point, since they’re good enough to solve most real-life computer vision problems. If you’d like to learn more about the architecture design of EfficientNet, consider reading this article by Armughan Shahid.
Our project doesn’t involve a high level of complexity. That’s why B1 is a good depth for the EfficientNet backbone. Let’s import EfficientNet B1 and initialize an object of this class.
from tensorflow.keras.applications.efficientnet import EfficientNetB1, preprocess_input
backbone = EfficientNetB1(include_top = False,
input_shape = (128, 128, 3),
pooling = 'avg')
Look at the set of parameters used for initialization. There are a plenty of them, but I’d like to focus on only a few:
- include_top is a boolean that specifies whether to include the fully-connected layer at the top of the network. For a custom image classifier, we’ll need to create our own top part of the network to reflect the number of classes we have;
- input_shape is a tuple that indicates input image dimensions: (image height, image width, number of channels). For B1, I decided to go with quite a small input image size – (128, 128, 3), but keep in mind that the deeper your convolutional neural net, the higher the image size. You definitely don’t want to move from B1 to B3 and keep the input image size the same;
- pooling specifies the pooling mode used for feature extraction. Pooling helps us down sample the features in feature maps. Selected ‘avg’ is for average pooling mode.
Since we didn’t include the top part, we have to explicitly define it. In particular, we should determine:
- Number of fully-connected layers in the top part;
- Number of neurons in each fully-connected layer;
- Activation function used after each layer;
- Methods used to make artificial neural networks faster, more stable and less prone to overfitting (e.g.: regularization, normalization, etc);
- Last layer design that reflects the classification problem that we’re trying to tackle.
Here’s the architecture for the top part that I selected for this project:
from tensorflow.keras.layers import Dense, BatchNormalization, LeakyReLU, Softmax
from tensorflow.keras.models import Sequential
n_classes = 4
dense_count = 256
model = Sequential()
model.add(backbone)
model.add(Dense(dense_count))
model.add(LeakyReLU())
model.add(BatchNormalization())
model.add(Dense(n_classes))
model.add(Softmax())
As you can see, the model is designed using the Sequence class instance. As a first step, the model backbone is added. The backbone is still our imported and initialized EfficientNet CNN instance.
The top part for the model has two fully-connected (dense) layers. The first one has 256 neurons in it and a LeakyReLU activation function. Batch normalization ensures speed and stability through the normalization of layer inputs.
The second layer has a number of neurons equal to the number of classes. Softmax activation is used to make a prediction.
This is the basic design for the top part that I use for most of the baseline models I start with. As a lifehack, I suggest training your model with the top part that has no intermediate fully-connected layers, just the final fully-connected layer that’s used for prediction making.
At first, it might seem like a bad idea, since it may cause a drop in model performance due to reduced model capacity. Surprisingly, my personal experience has proven that that’s not the case with EfficientNet. Dropping intermediate layers in the top part doesn’t cause performance decline. Moreover, in 7 out of 10 cases, it results in better performance while reducing final model size. Pretty inspiring, isn’t it? Give it a shot, and you’ll see how cool EfficientNet is.

Model training
By now you should have a generator and a model design to proceed with. Now, let’s train the model and get the result.
Might be useful
Check how to you can keep track of model training metadata with Neptune + TensorFlow/Keras integration.
I usually use a two-stage approach to model training. It’s especially handy for fine-tuning, when I don’t want to drastically change the original weights if a new dataset is introduced for transfer learning. With a high learning rate set up, it will cause significant shifts for the weights at the very first model layers which are responsible for simple, low-level feature extraction, and are already well trained on the ImageNet dataset.
In a two-stage approach, we freeze almost the entire backbone, leaving only the last layers to be trainable. With such a freeze, the model is trained for a few epochs (I usually go with no more than 5-10 epochs if my training dataset is big enough). When the weights for these last layers are trained and further training doesn’t improve model performance, we then unfreeze the entire backbone, giving the model a chance to make slight changes for the weights in the previously frozen layers and, therefore, get better results while keeping the training process stable.
Here’s what the two-stage approach looks like in code:
1. The model is frozen

The backbone of our model has 7 blocks. For the first stage, the first four blocks are frozen, so all layers in these blocks are not trainable. The fifth, sixth and seventh blocks, as well as the top part of the model, are not frozen and will be trained during the first stage. Here’s how the model freezing is performed in code:
block_to_unfreeze_from = 5
trainable_flag = False
for layer in model.layers[0].layers:
if layer.name.find('bn') != -1:
layer.trainable = True
else:
layer.trainable = trainable_flag
if layer.name.find(f'block{block_to_unfreeze_from}') != -1:
trainable_flag = True
layer.trainable = trainable_flag
for layer in model.layers[0].layers:
print (layer.name, layer.trainable)
In order to check the result of freezing, a simple print statement is used to check the trainability of each layer.
2. The model is compiled
from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
At the first stage, I suggest compiling a model with a slightly higher learning rate. For instance, 1e-3 is a good option to consider. When working with a Softmax activation function at the very last layer, the model produces a sparse output. That’s why sparse metric and loss function is used for compilation: they can properly treat the sparse output our model produces.
The word ‘sparse’ simply means that the model outputs a set of probabilities: one probability per class. Summed up, all of these probabilities are always equal to one. To make a conclusion about the predicted class, the index with highest probability should be selected. The index number is the class that was predicted by the model.
If you’re curious how sparse and non-sparse metrics differ, here’s an example for accuracy:
- categorical_accuracy checks to see if the index of the maximal true value is equal to the index of the maximal predicted value.
- sparse_categorical_accuracy checks to see if the maximal true value is equal to the index of the maximal predicted value.
3. The training job is launched using a standard .fit method in Tensorflow / Keras:
logdir = os.path.join(dir4saving, 'logs')
os.makedirs(logdir, exist_ok=True)
tbCallBack = keras.callbacks.TensorBoard(log_dir = logdir,
histogram_freq = 0,
write_graph = False,
write_images = False)
first_stage_n = 15
model.fit_generator(generator = train_generator,
steps_per_epoch = training_steps_per_epoch,
epochs = first_stage_n,
validation_data = validation_generator,
validation_steps = validation_steps_per_epoch,
callbacks=[tbCallBack],
use_multiprocessing = True,
workers = 16
)
- training_steps_per_epoch and validation_steps_per_epoch are just two integers that are calculated as following:
– training_steps_per_epoch = int(len(train_set) / batch_train)
– validation_steps_per_epoch= int(len(validation_set) / batch_validation)
- tbCallBack is a callback for a Tensorboard
- first_stage_n is the number of epochs for training for the first stage
- use_multiprocessing and workers are two parameters that set multiprocessing and the number of CPU cores to be used
By the end of the first stage, model performance has reached a good level: the main metric (sparse categorical accuracy) achieved 80%. Let’s now unfreeze the blocks which were previously frozen, recompile the model, and launch the training job for the second stage.
4. Unfreezing the model, preparing it for the second training stage:
Here’s how how unfreezing is done:
# unfreezing all layers in CNN
for layer in model.layers:
layer.trainable = True
for layer in model.layers[0].layers:
layer.trainable = True
5. The model is recompiled to apply changes in blocks freezing
The compilation is similar to what we did during the first stage. The only difference is in the learning rate value set up. At the second stage, it should be reduced. For this project, I decreased it from 1e-3 to 1e-4.
6. Second stage training is launched using a standard .fit method in Tensorflow / Keras
Again, the training job kick-off isn’t much different from what we had during the first stage, except for the number of used callbacks. There are plenty of them available to us. I highly recommend reading this article to get familiar with the options available in TensorFlow / Keras.
By the end of the second stage, model performance most likely reached around 99.97 % – not bad! The entire training used a single GeForce RTX 2080 GPU, and took around 4-5 hours. The final model checkpoint is indeed very lightweight, taking only 85 mb of memory.
Conclusions
We’ve gone over a real-life computer vision project, creating an image classifier in TensorFlow / Keras. The CNN that we employed as a backbone is the cutting-edge EfficientNet architecture from Google. The model size we ended up with is incredibly light.
To compare, I experimented a lot with other CNNs for backbones. The best results that I got were with ResNet 50. Using this architecture, the final model size was around 550 mb, and the best achieved accuracy was 95.1 %. Even if we decided to move deeper with EfficientNet, moving from B-0 to B1, B2 or even B3, they’ll all be much lighter compared to ResNet.
Remember, you can consider dropping the top part of the model and get an even lighter architecture that will also perform at a pretty decent level. With no top part and with ResNet B-0, I was able to achieve 99.12 % in accuracy, and the final model size was quite small: just 39 mb. Big round of applause to EfficientNet!

References
- CS231n course from Stanford University, which explains convolutional neural networks for visual recognition;
- Convolutional Neural Networks course by Andrew Ng, a well known machine learning instructor who knows how to explain complex concepts in simple terms;
- ML Practicum on Image Classification from Google, a one-pager that provides the most essential things that you need to know to start working with CNNs. It’s a great source of information for those who lack the time, but still need to dive into the topic.