MLOps Blog

TensorFlow Object Detection API: Best Practices to Training, Evaluation & Deployment

13 min
24th August, 2023

This article is the second part of a series where you learn an end to end workflow for TensorFlow Object Detection and its API. In the first article, you learned how to create a custom object detector from scratch, but there are still plenty of things that need your attention to become truly proficient.

We’ll explore topics that are just as important as the model creation process we’ve already gone through. Here are some of the questions we’ll answer:

  • How to evaluate my model and get an estimate of its performance?
  • What are the tools that I can use to track model performance and compare results across multiple experiments?
  • How can I export my model to use it in the inference mode?
  • Is there a way to boost model performance even more?

Model evaluation

It’s interesting to know how our model is going to perform in the wild. To get an idea of how well our model does on real data, we need to do a few things:

  1. Select a set of evaluation metrics,
  2. Get separate datasets to validate and test your model,
  3. Launch the evaluation process with an appropriate set of parameters.


The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

Performance Metrics in Machine Learning [Complete Guide]

TensorFlow & Neptune Integration

Evaluation, Step 1: Metrics

Let’s start with an evaluation set of metrics. You might remember that in the first article, we installed a dependency called COCO API. 

We needed this to access a set of useful metrics for object detection: mean average precision and recall. In case you don’t remember these metrics, you should definitely read about them. Computer vision engineers use them quite a lot.

To use mean average precision and recall, you should configure your pipeline.config file. The metrics_set parameter in the eval_config block should be set to “coco_detection_metrics”.

This is a default option for this parameter, so you most likely already have it there. Check if your eval_config lines look like this:

This is a place within the pipeline.config file where we specify metrics we want to use for evaluation

When we employ “coco_detection_metrics” set within metrics_set, this is what becomes available:

Mean average precision
Mean average precision (mAP) shown as a plot after we enable it for model validation. Note that mAP is calculated for different IOU values. Average recall is not shown but also becomes available.

Your choice for the set of metrics is not limited to precision and recall. There are some other options available in the TensorFlow API. Take your time to choose the option you want for tracking your particular model.

Evaluation, Step 2: Datasets

If you carefully followed the instructions in the first article, then dataset preparation should sound familiar. 
As a reminder, we prepared two files (validation.record and test.record) needed for model evaluation and placed them into Tensorflow/workspace/data. If your data folder has the same number of files as below, then you’re all set to proceed to the next step!

└─ cocoapi/
└─ ...
└─ workspace/
   └─ data/
      ├─ train.record # dataset that was used to train our model
      ├─ validation.record # dataset that we’ll use for model validation
      ├─ test.record # dataset that we’ll use to test our model

In case you’re missing one of the .record files in your data folder but still want to do the evaluation, here is something to consider:

  • validation.record is needed to evaluate your model during training;
  • test.record is needed to check end-model performance when it’s already been trained.

The traditional approach to machine learning needs 3 separate sets: for training, evaluation and testing. I urge you to follow it, but in case you have a valid reason to avoid one of these sets, then prepare only those .record files that are relevant for your purposes.

Evaluation, Step 3: Process Launch

As mentioned before, you can do the model evaluation at two different timestamps: during training or after the model’s been trained. 

Model evaluation during training is called validation. The TensorFlow Object Detection API’s validation job is treated as an independent process that should be launched in parallel with the training job. 

When launched in parallel, the validation job will wait for checkpoints that the training job generates during model training and use them one by one to validate the model on a separate dataset. 

validation.record represents a separate dataset that the model uses for validation. The metrics_set parameter within the eval_confi block defines a set of metrics for evaluation.

In order to start the validation job, open a new Terminal window, navigate to Tensorflow/workspace/, and launch the following command:

  --pipeline_config_path=<path to your config file>
  --model_dir=<path to a directory with your model>
  --checkpoint_dir=<path to a directory with checkpoints>
  --num_workers=<int for the number of workers to use>


  • <path to your config file> is a path to the config file which was used for training the model you want to evaluate. Should be a config file from ./models/<folder with the model of your choice>/v1/,
  • <path to a directory with your model> is a path to a directory where the evaluation job will write logs (evaluation results). My recommendation is to use the following path: ./models/<folder with the model of your choice>/v1/ . Given this, your evaluation results will be placed next to training logs,
  • <path to a directory with checkpoints> is a directory where your training job writes checkpoints. Should also be the following: ./models/<folder with the model of your choice>/v1/,
  • <int for the number of workers to use> if you have a multi-core CPU, this parameter defines the number of cores that can be used for the evaluation job. Keep in mind that your training job has already occupied the number of cores that you allocated for it. Given that, set the number of cores for evaluation appropriately.

Right after executing the above command, your evaluation job will begin. Similarly to what we did for the training job, if you want to do the evaluation on a GPU, enable it by executing the following command before launching the evaluation job:


Where <GPU number> defines an order number of the GPU you want to use. Note that the order count starts from zero. For validation on a CPU, use -1 like in the command below:


Model performance tracking

Intro to model performance tracking

track that

In machine learning, it’s hard to tell in advance which model will give you the best results for a given task. Developers often test multiple hypotheses using a trial and error approach. 

You can examine different model architectures or stick to one architecture but play around with different parameter setups. Each and every configuration should be tested via a separate training job launch, so a tool for tracking and comparing multiple experiments comes in handy. 

The TensorFlow API writes model performance-related logs and optimizer state using the tfevents format. There are two main tfevents you want to keep track of: training-related and evaluation-related

The training tfevent is limited to loss and learning rate tracking. It keeps track of the number of steps per epoch, so you can see how fast your training job is going. 

Such experiment metadata is automatically logged when you launch a training job for your model. Logs are stored at Tensorflow/workspace/models/<folder with the model of your choice>/v1/train. When visualized using Tensorboard (which we’ll talk about in a second), this is how it looks like:

Training tf-event
Training tf-event (logs) visualized using Tensorboard

Note that you’re able to see the loss on a component level breakdown (separately for classification and localization), and also as a total value. It becomes especially useful when you face a problem and want to examine your model to find the root cause of it.

As said previously, we can also keep track of how the learning rate changes over time and how many steps per second your training job does.

The evaluation tfevent, similar to training tfevent, also consists of a loss component with the same breakdown. Besides that, it keeps track of evaluation metrics that we talked about before.

Tracking tools


There are multiple tools that can help you track and compare model-related logs. The one that is already built into the TensorFlow API is Tensorboard

Tensorboard is relatively easy to use. In order to launch your TensorBoard, open a Terminal window and navigate to Tensorflow/workspace/models/<folder with the model of your choice>/ directory. 

When there, use this command to launch Tensorboard:

tensorboard --logdir=<path to a directory with your experiment / experiments>

You can pass to –logdir a path to a folder with logs for multiple experiments in it (e.g.: Tensorflow/workspace/models/).

You can also limit accessible data by providing a path to logs for a particular experiment (e.g.: Tensorflow/workspace/models/<folder with the model of your choice>/). 

In any case, Tensorboard will automatically find all directories with logs and use this data to build plots. You can learn more about what Tensorboard can do in the official guide. is an alternative tracking tool for your consideration. Compared to Tensorboard, it provides a broader range of features. Here is what I found especially convenient:

  • Neptune is fully compatible with the tfevent (TensorBoard) format. All that you need to do is launch a single command line in your Terminal window,
  • You can import only those experiments that you consider important. It lets you filter out those launches you want to exclude from the comparison. Given that, your final dashboard will stay neat and won’t be overloaded with too many experiments,
  • Your work (notebooks, experiments results) is shareable with others in a very easy way (just send a link),
  • You can track anything you want. It becomes especially handy when you also want to keep track of model parameters and/or its artifacts. Your hardware utilization is also visible, and all of that is available to you in a single place.


TensorBoard vs Neptune: How Are They ACTUALLY Different
How you can track your model training metadata with Neptune + TensorFlow Integration

Model export

All right, your model has now been trained, you are satisfied with its performance, and now want to use it for inference. Let me show you how you can do that. It’s going to be a two-step process:

1. First step – model export. To do that, you should:

  • Copy and paste the exporting script from Tensorflow/models/research/object_detection/
  • Within Tensorflow/workspace, create a new folder called exported_models. This is going to be a place where you will place all of your exported models,
  • Within Tensorflow/workspace/exported_models create a subfolder where a particular exported model will be stored. Give this folder the same name as <folder with the model of your choice> that you used within Tensorflow/workspace/models/<folder with the model of your choice>,

Open a new Terminal window, make Tensorflow/workspace your current working directory, and launch the following command:

  --pipeline_config_path=<path to a config file>
  --trained_checkpoint_dir=<path to a directory with your trained model>
  --output_directory=<path to a directory where to export a model>


  • <path to your config file> is a path to the config file for the mode you want to export. Should be a config file from ./models/<folder with the model of your choice>/v1/ 
  • <path to a directory with your trained model> is a path to a directory where model checkpoints were placed during training. Should also be the following: ./models/<folder with the model of your choice>/v1/  
  • <path to a directory where to export a model> is a path where an exported model will be saved. Should be: ./exported_models/<folder with the model of your choice> 

2. Second step – running your model in inference mode.

For your convenience, I made a jupyter notebook that has all the code you need to do inference. Your goal is to go over it and fill up all missing values for TODOs.

In the jupyter notebook, you’ll find two functions for inference that you can use depending on your goal: inference_with_plot and inference_as_raw_output

Use inference_with_plot when you just want to visualize your model output as bounding boxes plotted over the objects for your input image. Function output, in this case, is going to be a plot like the one below:

Example of an output for inference_with_plot function. | Image source: wikipedia object detection page

Alternatively, you can use inference_as_raw_output which, instead of plotting, returns a dict that has 3 keys:

  • Under detection_classes key you have an array with all classes which were detected. Classes are returned as integers,
  • Use detection_scores (array) to see scores for detection confidence for each detected class,
  • Lastly, detection_boxes is an array with coordinates for bounding boxes for each detected object. Each box has the following format – [y1, x1, y2, x2]. Top left corner is defined as y1 and x1, whereas bottom right is defined as y2 and x2.

Opportunities for model improvement

In this part, I want to share with you some cool methods that can boost your model performance. My goal here is to give you a high-level overview of what’s available in the TensorFlow API and its arsenal. I will also give you an intuition for implementing these methods. Let’s get started!

Image preprocessing

You should know what you feed into your model. Image preprocessing is a crucial step in any computer vision application. 

TensorFlow does the image normalization step (or standardization step if you prefer, the differences between normalization and standardization are nicely described here) under the hood, and we can’t influence that. But we can control how the image is resized and which size it’s resized to. 
To get a better sense of how the TensorFlow API does this, let’s have a look at a code snippet from pipeline.config for EfficientDet D-1 model:

image resizer
Code snippet within pipeline.config file that defines image resizing step in EfficientDet D-1 model. 

The default method responsible for image resizing for EfficientDet D-1 is keep_aspect_ratio_resizer

This method, as defined by min_dimension and max_dimension parameters in the above example, will resize a smaller side of an image to 640 pixels. The other side will be resized to preserve the original aspect ratio. 

pad_to_max_dimension stored as true will allow padding, which might be needed to keep the original aspect ratio during resizing. 

It’s interesting to look at the output for this resizing method. If your original image had a rectangular shape, then you might end up with an image that was quite extensively padded during resizing. Here is how your resulting image might look like if you check it via a tracking tool of your choice:

padded image

An example of a padded image that you might end up when keep_aspect_ratio_resizer method is used. | Image source: How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way) by Jakub Cieślik

We definitely don’t want to feed such an image to our net. Obviously, it has too much meaningless information coded as black pixels. How can we make it better? We can use a different resizing method.

In the first article, you learned how to approach parameter tuning in an advanced way. Using that approach, you will find other resizing methods available to us within the TensorFlow API. 

The one that might be particularly interesting for us is fixed_shape_resizer, which reshapes the image to a rectangle with a given size defined by height and width parameters. 

Have a look at its implementation within the pipeline.config file:

Fixed_shape_resizer method implementation for EfficientDet D-1

Two things are worth your attention in the above image. 

First, just how easy it is to switch from one method to another: a few lines of changes, nothing complex. 

Second, you now have full control over your input image. Playing around with resizing methods and the size of your input image can help preserve features essential for solving object detection tasks. 

Keep in mind that the smaller your input image, the harder it is for the net to detect objects! It becomes a problem when you want to detect objects that are small compared to your original image size.


What Image Processing Techniques Are Actually Used in the ML Industry?

Image Processing in Python: Algorithms, Tools, and Methods You Should Know

Image augmentation

Let’s continue exploring image-related methods with another opportunity for improvement – image augmentation.

image transformation
Example of variances in image augmentation. | Source: Data Augmentation in Python: Everything You Need to Know by Vladimir Lyashenko

Image augmentation is a way to randomly apply transformations to input images, introducing extra variance in the training dataset. Extra variance, in turn, leads to better model generalization, which is essential for good performance.

The TensorFlow API has a variety of options available for us! Let’s look at the pipeline.config file to see default options for augmentation:

image augmentation options
Default image augmentation options for EfficientDet D-1.

As we can see, there are two default options. You must carefully examine your problem domain and decide which augmentation options are relevant to your particular task. 

For example, if you expect that all input images will always be in a particular orientation, random_horizontal_flip will hurt rather than help because it randomly flips input images. Get rid of it since it’s not relevant for your case. Apply similar logic to choosing other augmentation options.

You might be interested in other options available in the TensorFlow API. For your convenience, here’s a link to a script where all methods are listed and well described.

 image augmentation options in Tensorflow
List of options for image augmentation available in TensorFlow API

It’s worth mentioning that in case of any transformation which will affect image orientation (rotation, flipping, scaling, etc), TensorFlow doesn’t only transform the image itself but also transforms coordinates for bounding boxes. There is no need for you to do anything for label transformation.


Data Augmentation in Python: Everything You Need to Know

Anchor generation

What do the shapes of bounding boxes for objects in your image look like? Are they mostly square-like or rectangular? Is there a particular aspect ratio for the bounding boxes that best catches objects of your interest? 

You should ask yourself these questions to empower your object detection with the ability to find the best possible boxes for your objects.

This becomes especially handy for one-stage object detectors (like EfficientDet) due to the fact that a pre-default set of anchors is used for making proposals. 

Can we change anchors to best-fit shapes of objects in our custom dataset? Definitely, yes! Here are the lines of code within the pipeline.config file that are responsible for anchor set-up:

anchor generator
Lines within pipeline.config futile that are responsible for a set of model’s anchors

There is one particular parameter that we’re most interested in, which is aspect_ratios. It defines a ratio for sides of a rectangular anchor. 

Let’s consider aspect_ratios: 2.0 as an example, so you can get a sense of how it works. 2.0 value means that the height of an anchor = 2x its width. Such anchor geometry will best fit those objects that are two times horizontally stretched compared to their vertical size. 

What if our objects are 10 times horizontally stretched? Let’s set an anchor that will catch such objects: aspect_ratios: 10.0 will do the job. 

Reversely, if your objects are stretched in the vertical dimension, set aspect_ratios to be between 0 and 1. The value between 0 and 1 will define by how much the width of your anchor is smaller compared to its height. You can set how many anchors you want. Just keep adding aspect_ratios as long as you feel it makes sense.

You can even do prior homework and go through an exploration phase for your machine learning project, analyzing the geometry of your objects. Personally, I like to create two plots where I look at the distribution for height-to-width and width-to-height ratios. This helps me visualize which aspect ratios will work the best for my model’s anchors:

width-to-height ratio distribution
Example for width-to-height ratio distribution that I plot when looking for the best shape for my anchors.

Post-processing and overfitting prevention

Similarly to pre-processing, the post-processing step can also affect your model’s behavior. Object detectors tend to generate hundreds of proposals. Most of them won’t be accepted and will be eliminated by some criteria. 

TensorFlow allows you to define a set of criteria to control model proposals. Let’s have a look at this code snippet within pipeline.config file:

post processing
Piece of code (default values are kept) that defines post-processing parameters for
EfficientDet D-1

There is a method called non-maximum suppression (NMS) that’s used for processing within EfficientDet D-1. The method itself has proven to be successful for the vast majority of computer vision tasks, so I won’t explore any alternatives. 

Non maximum suppression
Non maximum suppression in action. | Source: Non-maximum Suppression (NMS) by Sambasivarao. K

What matters here is a set of parameters that goes with the batch_non_max_suppression method. These parameters are important and might play a big role in your model end performance. Let’s see how they can do that:

  • score_threshold is a parameter that defines a minimum confidence score for classification, which should be reached so the proposal won’t be filtered out. In a default configuration, it’s set to a value close to 0, meaning that all proposals are accepted. Does it sound like a reasonable value? My personal practice has shown that minimum filtering tends to give better results at the end. Eliminating those proposals that are most likely to be incorrect leads to more stable training, better convergence, and lower chances of overfitting. Consider setting this parameter to at least 0.2. It’s especially important to do when your tracking tool shows that your net does poor proposals on the evaluation set or/and your evaluation metrics are not improving over time;
  • iou_threshold is a parameter that lets NMS make proper filtering for overlapping boxes. If your model generates overlapping boxes for objects, consider lowering this score. In case you have a dense distribution of objects on your image, think about increasing this parameter;
  • max_detections_per_class is pretty straightforward in its name. How many objects do you expect for every single class? A couple, a dozen, or hundreds? This parameter will help your net get a sense of that. My suggestion here is to set this value as equal to the maximum number of objects for a single class times the number on anchors (number of aspect_ratios) you have;
  • max_total_detections should be set to max_detections_per_class * total number of classes. It’s also important to set max_number_of_boxes to the same number as max_total_detections. max_number_of_boxes is located within the train_config part of your pipeline.config file.

Given the above approach to setting parameters, you’ll let your model have an idea of how many objects to expect and what their density is. It will lead to better end performance and will also lower chances of overfitting.

Since we touched on the overfitting problem, I’ll also share another common tool for eliminating it – the dropout layer, which you implement like this:

box predictor
Dropout with probability = 0.2 set for box_predictor net within EfficientDet D-1

Dropout implementation forces your model to look for those features that best describe objects you want to detect. It helps improve generalization capabilities. Better generalization helps your model be more resistant to overfitting.

dropout layer
Illustration for a dropout layer (with probability = 0.5) implemented within a simple neural net. | Source:

Last but not least, you can avoid overfitting and get better model performance via advanced approaches to learning rate control over time. Specifically, we’re interested in ways to push our training job to find a real global minimum for a given loss function. 

Learning rate scheduling is crucial for this goal. Let’s have a look at what TensorFlow offers for learning rate scheduling in a default configuration for EfficientDet D-1:

learning rate
Learning rate scheduler implementation in a default configuration for EfficientDet D-1

Cosine learning rate decay is a great scheduler that allows your learning rate both to grow and to decrease throughout the training time.

Learning rate changes
Learning rate changes over time when Cosine Learning Rate Decay is implemented. | Source: Cosine Learning rate decay by Sebastian Correa

Why can this approach to scheduling give you better model performance and overfitting prevention? For several reasons:

  • Starting with a low learning rate gives you control over gradients at the very beginning of training your model. We don’t want them to become extremely big, so the original model’s weights are not changed drastically. Keep in mind that we fine-tune our model on a custom dataset, and there’s no need to change low-level features that the neural net has already learned. They will most likely stay the same for our model;
  • An initial increase in learning rate will help your model have enough capacity not to get stuck in a local minimum and be able to get out of it;
  • Smooth learning rate decay over time will lead to stable training, and will also let your model find the best possible fit for your data.

Are you convinced now that learning rate scheduling is important? If yes, here’s how to configure it properly:

  • learning_rate_base is an initial learning rate your model will start training with;
  • total_steps defines the number of total steps your model is going to train. Keep in mind that at the last steps of your training job, the learning rate scheduler will drive learning rate value to be close to zero;
  • warmup_learning_rate is a maximum value that the learning rate will reach before starting to decrease;
  • warmup_steps defines the number of steps that will be taken to increase learning rate from learning_rate_base to warmup_learning_rate

Loss function manipulation

You might have been in situations when your model does a brilliant job localizing objects but performs quite poorly in classification. In contrast, it could be the case that classification is extremely good, but object localization could be better.

It becomes especially important when an object detector is included into a pipeline of services where each service is a machine learning model. In this case, each model output should be good enough for the following model to digest it as an input. 

Think about it this way: you’re trying to detect all pieces of text on an image in order to pass each piece of text to the next OCR model. What if your model detects all pieces of texts but sometimes cuts text off due to poor localization? 

That’s going to be a problem for the following OCR because it won’t get the entire text to read. OCR will be able to process a cut piece of text, but its output will be meaningless for us. How can we work with that?

TensorFlow gives you an option to prioritize what matters to you via weights within the loss function. Have a look at this code snippet:

Initial set up for weights within loss function. Equal values for classification and localization.

You can change values for these parameters to give a higher weight to what matters to you the most. Alternatively, you can lower a value for a particular part within the total loss. Both approaches do eventually the same job. 

If you decide to play around with weight values, my personal recommendation would be to start incrementing weights by values around between [0.1-0.3]. Bigger values might lead to a significant imbalance.


Your proficiency with the TensorFlow API has reached a new level. You’re now in full control of your experiments and know how to evaluate and compare them, so only the best will go to production!

You’re also familiar with how to move your model to production. You know how to export a model and have all code necessary to perform inference. 

Hopefully, you now have a feeling of what your opportunities are for further model improvement. Give it a shot. You’ll love it when you see your metrics grow. Be creative with your hypothesis setting, and don’t be afraid to try new ideas. Maybe your next configuration will set a benchmark for all of us! Who knows?

you got this

See you next time!

Was the article useful?

Thank you for your feedback!