This article is the second part of a series where you learn an end to end workflow for TensorFlow Object Detection and its API. In the first article, you learned how to create a custom object detector from scratch, but there are still plenty of things that need your attention to become truly proficient.
We’ll explore topics that are just as important as the model creation process we’ve already gone through. Here are some of the questions we’ll answer:
- How to evaluate my model and get an estimate of its performance?
- What are the tools that I can use to track model performance and compare results across multiple experiments?
- How can I export my model to use it in the inference mode?
- Is there a way to boost model performance even more?
It’s interesting to know how our model is going to perform in the wild. To get an idea of how well our model does on real data, we need to do a few things:
- Select a set of evaluation metrics,
- Get separate datasets to validate and test your model,
- Launch the evaluation process with an appropriate set of parameters.
Evaluation, Step 1: Metrics
Let’s start with an evaluation set of metrics. You might remember that in the first article, we installed a dependency called COCO API.
We needed this to access a set of useful metrics for object detection: mean average precision and recall. In case you don’t remember these metrics, you should definitely read about them. Computer vision engineers use them quite a lot.
To use mean average precision and recall, you should configure your pipeline.config file. The
metrics_set parameter in the
eval_config block should be set to “coco_detection_metrics”.
This is a default option for this parameter, so you most likely already have it there. Check if your
eval_config lines look like this:
When we employ “coco_detection_metrics” set within
metrics_set, this is what becomes available:
Your choice for the set of metrics is not limited to precision and recall. There are some other options available in the TensorFlow API. Take your time to choose the option you want for tracking your particular model.
Evaluation, Step 2: Datasets
If you carefully followed the instructions in the first article, then dataset preparation should sound familiar.
As a reminder, we prepared two files (validation.record and test.record) needed for model evaluation and placed them into Tensorflow/workspace/data. If your data folder has the same number of files as below, then you’re all set to proceed to the next step!
Tensorflow/ └─ cocoapi/ └─ ... └─ workspace/ └─ data/ ├─ train.record # dataset that was used to train our model ├─ validation.record # dataset that we’ll use for model validation ├─ test.record # dataset that we’ll use to test our model
In case you’re missing one of the .record files in your data folder but still want to do the evaluation, here is something to consider:
validation.recordis needed to evaluate your model during training;
test.recordis needed to check end-model performance when it’s already been trained.
The traditional approach to machine learning needs 3 separate sets: for training, evaluation and testing. I urge you to follow it, but in case you have a valid reason to avoid one of these sets, then prepare only those .record files that are relevant for your purposes.
Evaluation, Step 3: Process Launch
As mentioned before, you can do the model evaluation at two different timestamps: during training or after the model’s been trained.
Model evaluation during training is called validation. The TensorFlow Object Detection API’s validation job is treated as an independent process that should be launched in parallel with the training job.
When launched in parallel, the validation job will wait for checkpoints that the training job generates during model training and use them one by one to validate the model on a separate dataset.
validation.record represents a separate dataset that the model uses for validation. The
metrics_set parameter within the
eval_confi block defines a set of metrics for evaluation.
In order to start the validation job, open a new Terminal window, navigate to Tensorflow/workspace/, and launch the following command:
python model_main_tf2.py \ --pipeline_config_path=<path to your config file> \ --model_dir=<path to a directory with your model> \ --checkpoint_dir=<path to a directory with checkpoints> \ --num_workers=<int for the number of workers to use> \ --sample_1_of_n_eval_examples=1
- <path to your config file> is a path to the config file which was used for training the model you want to evaluate. Should be a config file from ./models/<folder with the model of your choice>/v1/,
- <path to a directory with your model> is a path to a directory where the evaluation job will write logs (evaluation results). My recommendation is to use the following path: ./models/<folder with the model of your choice>/v1/ . Given this, your evaluation results will be placed next to training logs,
- <path to a directory with checkpoints> is a directory where your training job writes checkpoints. Should also be the following: ./models/<folder with the model of your choice>/v1/,
- <int for the number of workers to use> if you have a multi-core CPU, this parameter defines the number of cores that can be used for the evaluation job. Keep in mind that your training job has already occupied the number of cores that you allocated for it. Given that, set the number of cores for evaluation appropriately.
Right after executing the above command, your evaluation job will begin. Similarly to what we did for the training job, if you want to do the evaluation on a GPU, enable it by executing the following command before launching the evaluation job:
export CUDA_VISIBLE_DEVICES= <GPU number>
Where <GPU number> defines an order number of the GPU you want to use. Note that the order count starts from zero. For validation on a CPU, use -1 like in the command below:
Model performance tracking
Intro to model performance tracking
In machine learning, it’s hard to tell in advance which model will give you the best results for a given task. Developers often test multiple hypotheses using a trial and error approach.
You can examine different model architectures or stick to one architecture but play around with different parameter setups. Each and every configuration should be tested via a separate training job launch, so a tool for tracking and comparing multiple experiments comes in handy.
The TensorFlow API writes model performance-related logs and optimizer state using the tfevents format. There are two main tfevents you want to keep track of: training-related and evaluation-related.
The training tfevent is limited to loss and learning rate tracking. It keeps track of the number of steps per epoch, so you can see how fast your training job is going.
Such experiment metadata is automatically logged when you launch a training job for your model. Logs are stored at Tensorflow/workspace/models/<folder with the model of your choice>/v1/train. When visualized using Tensorboard (which we’ll talk about in a second), this is how it looks like:
Note that you’re able to see the loss on a component level breakdown (separately for classification and localization), and also as a total value. It becomes especially useful when you face a problem and want to examine your model to find the root cause of it.
As said previously, we can also keep track of how the learning rate changes over time and how many steps per second your training job does.
The evaluation tfevent, similar to training tfevent, also consists of a loss component with the same breakdown. Besides that, it keeps track of evaluation metrics that we talked about before.
There are multiple tools that can help you track and compare model-related logs. The one that is already built into the TensorFlow API is Tensorboard.
Tensorboard is relatively easy to use. In order to launch your TensorBoard, open a Terminal window and navigate to Tensorflow/workspace/models/<folder with the model of your choice>/ directory.
When there, use this command to launch Tensorboard:
tensorboard --logdir=<path to a directory with your experiment / experiments>
You can pass to –logdir a path to a folder with logs for multiple experiments in it (e.g.: Tensorflow/workspace/models/).
You can also limit accessible data by providing a path to logs for a particular experiment (e.g.: Tensorflow/workspace/models/<folder with the model of your choice>/).
In any case, Tensorboard will automatically find all directories with logs and use this data to build plots. You can learn more about what Tensorboard can do in the official guide.
Neptune.ai is an alternative tracking tool for your consideration. Compared to Tensorboard, it provides a broader range of features. Here is what I found especially convenient:
- Neptune is fully compatible with the tfevent (TensorBoard) format. All that you need to do is launch a single command line in your Terminal window,
- You can import only those experiments that you consider important. It lets you filter out those launches you want to exclude from the comparison. Given that, your final dashboard will stay neat and won’t be overloaded with too many experiments,
- Your work (notebooks, experiments results) is shareable with others in a very easy way (just send a link),
- You can track anything you want. It becomes especially handy when you also want to keep track of model parameters and/or its artifacts. Your hardware utilization is also visible, and all of that is available to you in a single place.
➡️ TensorBoard vs Neptune: How Are They ACTUALLY Different
➡️ How you can track your model training metadata with Neptune + TensorFlow integration.
All right, your model has now been trained, you are satisfied with its performance, and now want to use it for inference. Let me show you how you can do that. It’s going to be a two-step process:
1. First step – model export. To do that, you should:
- Copy and paste the exporting script from Tensorflow/models/research/object_detection/exporter_main_v2.py
- Within Tensorflow/workspace, create a new folder called exported_models. This is going to be a place where you will place all of your exported models,
- Within Tensorflow/workspace/exported_models create a subfolder where a particular exported model will be stored. Give this folder the same name as <folder with the model of your choice> that you used within Tensorflow/workspace/models/<folder with the model of your choice>,
Open a new Terminal window, make Tensorflow/workspace your current working directory, and launch the following command:
python exporter_main_v2.py \ --pipeline_config_path=<path to a config file> \ --trained_checkpoint_dir=<path to a directory with your trained model> \ --output_directory=<path to a directory where to export a model> \ --input_type=image_tensor
- <path to your config file> is a path to the config file for the mode you want to export. Should be a config file from ./models/<folder with the model of your choice>/v1/
- <path to a directory with your trained model> is a path to a directory where model checkpoints were placed during training. Should also be the following: ./models/<folder with the model of your choice>/v1/
- <path to a directory where to export a model> is a path where an exported model will be saved. Should be: ./exported_models/<folder with the model of your choice>
2. Second step – running your model in inference mode.
For your convenience, I made a jupyter notebook that has all the code you need to do inference. Your goal is to go over it and fill up all missing values for TODOs.
In the jupyter notebook, you’ll find two functions for inference that you can use depending on your goal:
inference_with_plot when you just want to visualize your model output as bounding boxes plotted over the objects for your input image. Function output, in this case, is going to be a plot like the one below:
Alternatively, you can use
inference_as_raw_output which, instead of plotting, returns a dict that has 3 keys:
detection_classeskey you have an array with all classes which were detected. Classes are returned as integers,
detection_scores(array) to see scores for detection confidence for each detected class,
detection_boxesis an array with coordinates for bounding boxes for each detected object. Each box has the following format – [y1, x1, y2, x2]. Top left corner is defined as y1 and x1, whereas bottom right is defined as y2 and x2.
Opportunities for model improvement
In this part, I want to share with you some cool methods that can boost your model performance. My goal here is to give you a high-level overview of what’s available in the TensorFlow API and its arsenal. I will also give you an intuition for implementing these methods. Let’s get started!
You should know what you feed into your model. Image preprocessing is a crucial step in any computer vision application.
TensorFlow does the image normalization step (or standardization step if you prefer, the differences between normalization and standardization are nicely described here) under the hood, and we can’t influence that. But we can control how the image is resized and which size it’s resized to.
To get a better sense of how the TensorFlow API does this, let’s have a look at a code snippet from pipeline.config for EfficientDet D-1 model:
The default method responsible for image resizing for EfficientDet D-1 is
This method, as defined by
max_dimension parameters in the above example, will resize a smaller side of an image to 640 pixels. The other side will be resized to preserve the original aspect ratio.
pad_to_max_dimension stored as true will allow padding, which might be needed to keep the original aspect ratio during resizing.
It’s interesting to look at the output for this resizing method. If your original image had a rectangular shape, then you might end up with an image that was quite extensively padded during resizing. Here is how your resulting image might look like if you check it via a tracking tool of your choice:
An example of a padded image that you might end up when keep_aspect_ratio_resizer method is used. | Image source: How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way) by Jakub Cieślik
We definitely don’t want to feed such an image to our net. Obviously, it has too much meaningless information coded as black pixels. How can we make it better? We can use a different resizing method.
In the first article, you learned how to approach parameter tuning in an advanced way. Using that approach, you will find other resizing methods available to us within the TensorFlow API.
The one that might be particularly interesting for us is
fixed_shape_resizer, which reshapes the image to a rectangle with a given size defined by
Have a look at its implementation within the pipeline.config file:
Two things are worth your attention in the above image.
First, just how easy it is to switch from one method to another: a few lines of changes, nothing complex.
Second, you now have full control over your input image. Playing around with resizing methods and the size of your input image can help preserve features essential for solving object detection tasks.
Keep in mind that the smaller your input image, the harder it is for the net to detect objects! It becomes a problem when you want to detect objects that are small compared to your original image size.
Let’s continue exploring image-related methods with another opportunity for improvement – image augmentation.
Image augmentation is a way to randomly apply transformations to input images, introducing extra variance in the training dataset. Extra variance, in turn, leads to better model generalization, which is essential for good performance.
The TensorFlow API has a variety of options available for us! Let’s look at the pipeline.config file to see default options for augmentation:
As we can see, there are two default options. You must carefully examine your problem domain and decide which augmentation options are relevant to your particular task.
For example, if you expect that all input images will always be in a particular orientation,
random_horizontal_flip will hurt rather than help because it randomly flips input images. Get rid of it since it’s not relevant for your case. Apply similar logic to choosing other augmentation options.
You might be interested in other options available in the TensorFlow API. For your convenience, here’s a link to a script where all methods are listed and well described.
It’s worth mentioning that in case of any transformation which will affect image orientation (rotation, flipping, scaling, etc), TensorFlow doesn’t only transform the image itself but also transforms coordinates for bounding boxes. There is no need for you to do anything for label transformation.
What do the shapes of bounding boxes for objects in your image look like? Are they mostly square-like or rectangular? Is there a particular aspect ratio for the bounding boxes that best catches objects of your interest?
You should ask yourself these questions to empower your object detection with the ability to find the best possible boxes for your objects.
This becomes especially handy for one-stage object detectors (like EfficientDet) due to the fact that a pre-default set of anchors is used for making proposals.
Can we change anchors to best fit shapes of objects in our custom dataset? Definitely, yes! Here are the lines of code within the pipeline.config file that are responsible for anchor set-up:
There is one particular parameter that we’re most interested in, which is
aspect_ratios. It defines a ratio for sides of a rectangular anchor.
aspect_ratios: 2.0 as an example, so you can get a sense of how it works.
2.0 value means that the height of an anchor = 2x its width. Such anchor geometry will best fit those objects that are two times horizontally stretched compared to their vertical size.
What if our objects are 10 times horizontally stretched? Let’s set an anchor that will catch such objects:
aspect_ratios: 10.0 will do the job.
Reversely, if your objects are stretched in the vertical dimension, set
aspect_ratios to be between 0 and 1. Value between 0 and 1 will define by how much the width of your anchor is smaller compared to its height. You can set how many anchors you want. Just keep adding
aspect_ratios as long as you feel it makes sense.
You can even do prior homework and go through an exploration phase for your machine learning project, analyzing the geometry of your objects. Personally, I like to create two plots where I look at the distribution for height-to-width and width-to-height ratios. This helps me visualize which aspect ratios will work the best for my model’s anchors:
Post-processing and overfitting prevention
Similarly to pre-processing, the post-processing step can also affect your model’s behavior. Object detectors tend to generate hundreds of proposals. Most of them won’t be accepted and will be eliminated by some criteria.
TensorFlow allows you to define a set of criteria to control model proposals. Let’s have a look at this code snippet within pipeline.config file:
There is a method called non-maximum suppression (NMS) that’s used for processing within EfficientDet D-1. The method itself has proven to be successful for the vast majority of computer vision tasks, so I won’t explore any alternatives.
What matters here is a set of parameters that goes with the
batch_non_max_suppression method. These parameters are important and might play a big role in your model end performance. Let’s see how they can do that:
score_thresholdis a parameter that defines a minimum confidence score for classification, which should be reached so the proposal won’t be filtered out. In a default configuration, it’s set to a value close to 0, meaning that all proposals are accepted. Does it sound like a reasonable value? My personal practice has shown that minimum filtering tends to give better results at the end. Eliminating those proposals that are most likely to be incorrect leads to more stable training, better convergence, and lower chances of overfitting. Consider setting this parameter to at least 0.2. It’s especially important to do when your tracking tool shows that your net does poor proposals on the evaluation set or/and your evaluation metrics are not improving over time;
iou_thresholdis a parameter that lets NMS make proper filtering for overlapping boxes. If your model generates overlapping boxes for objects, consider lowering this score. In case you have a dense distribution of objects on your image, think about increasing this parameter;
max_detections_per_classis pretty straightforward in its name. How many objects do you expect for every single class? A couple, a dozen, or hundreds? This parameter will help your net get a sense of that. My suggestion here is to set this value as equal to the maximum number of objects for a single class times the number on anchors (number of
aspect_ratios) you have;
max_total_detectionsshould be set to
max_detections_per_class* total number of classes. It’s also important to set
max_number_of_boxesto the same number as
max_number_of_boxesis located within the
train_configpart of your pipeline.config file.
Given the above approach to setting parameters, you’ll let your model have an idea of how many objects to expect and what their density is. It will lead to better end performance and will also lower chances of overfitting.
Since we touched on the overfitting problem, I’ll also share another common tool for eliminating it – the dropout layer, which you implement like this:
Dropout implementation forces your model to look for those features that best describe objects you want to detect. It helps improve generalization capabilities. Better generalization helps your model be more resistant to overfitting.
Last but not least, you can avoid overfitting and get better model performance via advanced approaches to learning rate control over time. Specifically, we’re interested in ways to push our training job to find a real global minimum for a given loss function.
Learning rate scheduling is crucial for this goal. Let’s have a look at what TensorFlow offers for learning rate scheduling in a default configuration for EfficientDet D-1:
Cosine learning rate decay is a great scheduler that allows your learning rate both to grow and to decrease throughout the training time.
Why can this approach to scheduling give you better model performance and overfitting prevention? For several reasons:
- Starting with a low learning rate gives you control over gradients at the very beginning of training your model. We don’t want them to become extremely big, so the original model’s weights are not changed drastically. Keep in mind that we fine-tune our model on a custom dataset, and there’s no need to change low-level features that the neural net has already learned. They will most likely stay the same for our model;
- An initial increase in learning rate will help your model have enough capacity not to get stuck in a local minimum and be able to get out of it;
- Smooth learning rate decay over time will lead to stable training, and will also let your model find the best possible fit for your data.
Are you convinced now that learning rate scheduling is important? If yes, here’s how to configure it properly:
learning_rate_baseis an initial learning rate your model will start training with;
total_stepsdefines the number of total steps your model is going to train. Keep in mind that at the last steps of your training job, the learning rate scheduler will drive learning rate value to be close to zero;
warmup_learning_rateis a maximum value that the learning rate will reach before starting to decrease;
warmup_stepsdefines the number of steps that will be taken to increase learning rate from
Loss function manipulation
You might have been in situations when your model does a brilliant job localizing objects but performs quite poorly in classification. In contrast, it could be the case that classification is extremely good, but object localization could be better.
It becomes especially important when an object detector is included into a pipeline of services where each service is a machine learning model. In this case, each model output should be good enough for the following model to digest it as an input.
Think about it this way: you’re trying to detect all pieces of text on an image in order to pass each piece of text to the next OCR model. What if your model detects all pieces of texts but sometimes cuts text off due to poor localization?
That’s going to be a problem for the following OCR because it won’t get the entire text to read. OCR will be able to process a cut piece of text, but its output will be meaningless for us. How can we work with that?
TensorFlow gives you an option to prioritize what matters to you via weights within the loss function. Have a look at this code snippet:
You can change values for these parameters to give a higher weight to what matters to you the most. Alternatively, you can lower a value for a particular part within the total loss. Both approaches do eventually the same job.
If you decide to play around with weight values, my personal recommendation would be to start incrementing weights by values around between [0.1-0.3]. Bigger values might lead to a significant imbalance.
Your proficiency with the TensorFlow API has reached a new level. You’re now in full control of your experiments and know how to evaluate and compare them, so only the best will go to production!
You’re also familiar with how to move your model to production. You know how to export a model and have all code necessary to perform inference.
Hopefully, you now have a feeling of what your opportunities are for further model improvement. Give it a shot. You’ll love it when you see your metrics grow. Be creative with your hypothesis setting, and don’t be afraid to try new ideas. Maybe your next configuration will set a benchmark for all of us! Who knows?
See you next time!
ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It
10 mins read | Author Jakub Czakon | Updated July 14th, 2021
Let me share a story that I’ve heard too many times.
”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…
…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…
…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”
– unfortunate ML researcher.
And the truth is, when you develop ML models you will run a lot of experiments.
Those experiments may:
- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)
And as a result, they can produce completely different evaluation metrics.
Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.
This is where ML experiment tracking comes in.Continue reading ->