Object detection finds and identifies things in images, and it’s one of the biggest accomplishments of deep learning and image processing. One of the common approaches to creating localizations for objects is with the help of bounding boxes. You can train an object detection model to identify and detect more than one specific object, so it’s versatile.
Object detection models are usually trained to detect the presence of specific objects. The constructed models can be used in images, videos, or real-time operations. Even before the deep learning methodologies and modern-day image processing technologies, object detection had a high scope of interest. Certain methods (like SIFT and HOG with their feature and edge extraction techniques) had success with object detection, and there were relatively few other competitors in this field.
With the introduction of convolutional neural networks (CNNs) and the adaption of computer vision technologies, object detection became much more common in the current generation. The new wave of object detection with deep learning approaches opens up seemingly endless possibilities.
Read also
Imbalanced Data in Object Detection Computer Vision Projects
How to Train Your Own Object Detector Using TensorFlow Object Detection API
Object detection makes use of the special and unique properties of each class to identify the required object. While looking for square shapes, the object detection model can look for perpendicular corners that will result in the shape of a square, with each side having the same length. While looking for a circular object, the object detection model will look for central points from which the creation of the particular round entity is possible. Such identification techniques are used for facial recognition or object tracking.
In this article, we’re going to explore different object detection algorithms and libraries, but first, some basics.
Where is object detection used?
As we go about our daily life, object detection is already all around us. For example, when your smartphone unlocks with face detection. Or in video surveillance of stores or warehouses, it identifies suspicious activities.
Here are several more major applications of object detection:
- Number plate recognition – using both object detection and optical character recognition (OCR) technology to recognize the alphanumeric characters on a vehicle. You can use object detection to capture images and detect vehicles in a particular image. Once the model detects the number plate, the OCR technology works on converting the two-dimensional data into machine-encoded text.
- Face detection and recognition – as previously discussed, one of the major applications of object detection is face detection and recognition. With the help of modern algorithms, we can detect human faces in an image or video. It’s now even possible to recognize faces with just a single trained image due to one-shot learning methods.
- Object tracking – while watching a game of baseball or cricket, the ball could hit far away. In these situations, it’s good to track the motion of the ball along with the distance it’s covering. For this purpose, object tracking can ensure that we have continuous information on the direction of movement of the ball.
- Self-driving cars – for autonomous cars, it’s crucial to study the different elements around the car while driving. An object detection model trained on multiple classes to recognize the different entities becomes vital for the good performance of autonomous vehicles.
- Robotics – many tasks like lifting heavy loads, pick and place operations, and other real-time jobs are performed by robots. Object detection is essential for robots to detect things and automate tasks.
Object detection algorithms
Since the popularization of deep learning in the early 2010s, there’s been a continuous progression and improvement in the quality of algorithms used to solve object detection. We’re going to explore the most popular algorithms while understanding their working theory, benefits, and their flaws in certain scenarios.
1. Histogram of Oriented Gradients (HOG)
→ Introduction
The Histogram of Oriented Gradients is one of the oldest methods of object detection. It was first introduced in 1986. Despite some developments in the upcoming decade, the approach did not gain a lot of popularity until 2005 when it started being used in many tasks related to computer vision. HOG uses a feature extractor to identify objects in an image.
The feature descriptor used in HOG is a representation of a part of an image where we extract only the most necessary information while disregarding anything else. The function of the feature descriptor is to convert the overall size of the image into the form of an array or feature vector. In HOG, we use the gradient orientation procedure to localize the most critical parts of an image.
→ Overview of architecture

Before we understand the overall architecture of HOG, here’s how it works. For a particular pixel in an image, the histogram of the gradient is calculated by considering the vertical and horizontal values to obtain the feature vectors. With the help of the gradient magnitude and the gradient angles, we can get a clear value for the current pixel by exploring the other entities in their horizontal and vertical surroundings.
As shown in the above image representation, we’ll consider an image segment of a particular size. The first step is to find the gradient by dividing the entire computation of the image into gradient representations of 8×8 cells. With the help of the 64 gradient vectors that are achieved, we can split each cell into angular bins and compute the histogram for the particular area. This process reduces the size of 64 vectors to a smaller size of 9 values.
Once we obtain the size of 9 point histogram values (bins) for each cell, we can choose to create overlaps for the blocks of cells. The final steps are to form the feature blocks, normalize the obtained feature vectors, and collect all the features vectors to get an overall HOG feature. Check the following links for more information about this: [1] and [2].
→ Achievements of HOG
- Creation of a feature descriptor useful for performing object detection.
- Ability to be combined with support vector machines (SVMs) to achieve high-accuracy object detection.
- Creation of a sliding window effect for the computation of each position.
→ Points to consider
- Limitations – While the Histogram of Oriented Gradients (HOG) was quite revolutionary in the beginning stages of object detection, there were a lot of issues in this method. It’s quite time-consuming for complex pixel computation in images, and ineffective in certain object detection scenarios with tighter spaces.
- When to use HOG? – HOG should often be used as the first method of object detection to test other algorithms and their respective performance. Regardless, HOG finds significant use in most object detection and facial landmark recognition with decent accuracy.
- Example use cases – One of the popular use cases of HOG is in pedestrian detection due to its smooth edges. Other general applications include object detection of specific objects. For more information, refer to the following link.
2. Region-based Convolutional Neural Networks (R-CNN)
→ Introduction
The region-based convolutional neural networks are an improvement in the object detection procedure from the previous methods of HOG and SIFT. In the R-CNN models, we try to extract the most essential features (usually around 2000 features) by making use of selective features. The process of selecting the most significant extractions can be computed with the help of a selective search algorithm that can achieve these more important regional proposals.
→ Working process of R-CNN

The working procedure of the selective search algorithm to select the most important regional proposals is to ensure that you generate multiple sub-segmentations on a particular image and select the candidate entries for your task. The greedy algorithm can then be made use of to combine the effective entries accordingly for a recurring process to combine the smaller segments into suitable larger segments.
Once the selective search algorithm is successfully completed, our next tasks are to extract the features and make the appropriate predictions. We can then make the final candidate proposals, and the convolutional neural networks can be used for creating an n-dimensional (either 2048 or 4096) feature vector as output. With the help of a pre-trained convolutional neural network, we can achieve the task of feature extraction with ease.
The final step of the R-CNN is to make the appropriate predictions for the image and label the respective bounding box accordingly. In order to obtain the best results for each task, the predictions are made by the computation of a classification model for each task, while a regression model is used to correct the bounding box classification for the proposed regions. For further reading and information about this topic, refer to the following link.
→ Issues with R-CNN
1. Despite producing effective results for feature extraction with the pre-trained CNN models, the overall procedure of extraction of all the region proposals, and ultimately the best regions with the current algorithms, is extremely slow.
2. Another major drawback of the R-CNN model is not only the slow rate of training but also the high prediction time. The solution requires the use of large computational resources, increasing the overall feasibility of the process. Hence, the overall architecture can be considered quite expensive.
3. Sometimes, bad candidate selections can occur at the initial step due to the lack of improvements that can be made in this particular step. A lot of problems in the trained model could be caused by this.
→ Points to consider
- When To Use R-CNN? – R-CNN similar to the HOG object detection method must be used as a first baseline for testing the performance of the object detection models. The time taken for predictions of images and objects can take a bit longer than anticipated, so usually the more modern versions of R-CNN are preferred.
- Example use cases – There are several applications of R-CNN for solving different types of tasks related to object detection. For example, tracking objects from a drone-mounted camera, locating text in an image, and enabling object detection in Google Lens. Check out the following link for more information.
3. Faster R-CNN
→ Introduction
While the R-CNN model was able to perform the computation of object detection and achieve desirable results, there were some major lackluster elements, especially the speed of the model. So, faster methods for tackling some of these issues had to be introduced to overcome the problems that existed in R-CNN. Firstly, the Fast R-CNN was introduced to combat some of the pre-existing issues of R-CNN.
In the fast R-CNN method, the entire image is passed through the pre-trained Convolutional Neural Network instead of considering all the sub-segments. The region of interest (RoI) pooling is a special method that takes two inputs of the pre-trained model and selective search algorithm to provide a fully connected layer with an output. In this section, we will learn more about the Faster R-CNN network, which is an improvement on the fast R-CNN model.
→ Understanding Faster R-CNN

The Faster R-CNN model is one of the best versions of the R-CNN family and improves the speed of performance tremendously from its predecessors. While the R-CNN and Fast R-CNN model make use of a selective search algorithm to compute the region proposals, the Faster R-CNN method replaces this existing method with a superior region proposal network. The region proposal network (RPN) computes images from a wide range and different scales to produce effective outputs.
The regional proposal network reduces the margin computation time, usually 10 ms per image. This network consists of the convolutional layer from which we can obtain the essential feature maps of each pixel. For each feature map, we have multiple anchor boxes which have varying scales, different sizes, and aspect ratios. For each anchor box, we make a prediction of the particular binary class and generate a bounding box for the same.
The following information is then passed through the non-maximum suppression to remove any unnecessary data since many overlaps are produced while creating the feature maps. The output from the non-maximum suppression is passed through the region of interest, and the rest of the process and computation is similar to the working of Fast R-CNN.
→ Points to consider
- Limitations – One of the main limitations of the Faster R-CNN method is the amount of time delay in the proposition of different objects. Sometimes, the speed depends on the type of system being used.
- When To Use Faster R-CNN? – The time for prediction is faster compared to other CNN methods. While R-CNN usually takes around 40-50 seconds for the prediction of objects in an image, the Fast R-CNN takes around 2 seconds, but the Faster R-CNN returns the optimal result in just about 0.2 seconds.
- Example use cases – The examples of use cases for Faster R-CNN are similar to the ones described in the R-CNN methodology. However, with Faster R-CNN, we can perform these tasks optimally and achieve results more effectively.
4. Single Shot Detector (SSD)
→ Introduction
The single-shot detector for multi-box predictions is one of the fastest ways to achieve the real-time computation of object detection tasks. While the Faster R-CNN methodologies can achieve high accuracies of prediction, the overall process is quite time-consuming and it requires the real-time task to run at about 7 frames per second, which is far from desirable.
The single-shot detector (SSD) solves this issue by improving the frames per second to almost five times more than the Faster R-CNN model. It removes the use of the region proposal network and instead makes use of multi-scale features and default boxes.
→ Overview of architecture

The single-shot multibox detector architecture can be broken down into mainly three components. The first stage of the single-shot detector is the feature extraction step, where all the crucial feature maps are selected. This architectural region consists of only fully convolutional layers and no other layers. After extracting all the essential feature maps, the next step is the process of detecting heads. This step also consists of fully convolutional neural networks.
However, in the second stage of detection heads, the task is not to find the semantic meaning for the images. Instead, the primary goal is to create the most appropriate bounding maps for all the feature maps. Once we have computed the two essential stages, the final stage is to pass it through the non-maximum suppression layers for reducing the error rate caused by repeated bounding boxes.
→ Limitations of SSD
- The SSD, while boosting the performance significantly, suffers from decreasing the resolution of the images to a lower quality.
- The SSD architecture will typically perform worse than the Faster R-CNN for small-scale objects.
→ Points to consider
- When To Use SSD? – The single-shot detector is often the preferred method. The main reason for using the single-shot detector is because we mainly prefer faster predictions on an image for detecting larger objects, where accuracy is not an extremely important concern. However, for more accurate predictions for smaller and precise objects, other methods must be considered.
- Example use cases – The Single-shot detector can be trained and experimented on a multitude of datasets, such as PASCAL VOC, COCO, and ILSVRC datasets. They can perform well on larger object detections like the detection of humans, tables, chairs, and other similar entities.
5. YOLO (You Only Look Once)
→ Introduction
You only look once (YOLO) is one of the most popular model architectures and algorithms for object detection. Usually, the first concept found on a Google search for algorithms on object detection is the YOLO architecture. There are several versions of YOLO, which we will discuss in the upcoming sections. The YOLO model uses one of the best neural network archetypes to produce high accuracy and overall speed of processing. This speed and accuracy is the main reason for its popularity.
→ Working process of YOLO

The YOLO architecture utilizes three primary terminologies to achieve its goal of object detection. Understanding these three techniques is quite significant to know why exactly this model performs so quickly and accurately in comparison to other object detection algorithms. The first concept in the YOLO model is residual blocks. In the first architectural design, they have used 7×7 residual blocks to create grids in the particular image.
Each of these grids acts as central points and a particular prediction for each of these grids is made accordingly. In the second technique, each of the central points for a particular prediction is considered for the creation of the bounding boxes. While the classification tasks work well for each grid, it’s more complex to segregate the bounding boxes for each of the predictions that are made. The third and final technique is the use of the intersection of union (IOU) to calculate the best bounding boxes for the particular object detection task.
→ Advantages of YOLO
- The computation and processing speed of YOLO is quite high, especially in real-time compared to most of the other training methods and object detection algorithms.
- Apart from the fast computing speed, the YOLO algorithm also manages to provide an overall high accuracy with the reduction of background errors seen in other methods.
- The architecture of YOLO allows the model to learn and develop an understanding of numerous objects more efficiently.
→ Limitations of YOLO
- Failure to detect smaller objects in an image or video because of the lower recall rate.
- Cant’t detect two objects that are extremely close to each other due to the limitations of bounding boxes.
→ Versions of YOLO
The YOLO architecture is one of the most influential and successful object detection algorithms. With the introduction of the YOLO architecture in 2016, their consecutive versions YOLO v2 and YOLO v3 arrived in 2017 and 2018. While there was no new release in 2019, 2020 saw three quick releases: YOLO v4, YOLO v5, and PP-YOLO. Each of the newer versions of YOLO slightly improved on their previous ones. The tiny YOLO was also released to ensure that object detection could be supported on embedded devices.
→ Points to consider
- When To Use YOLO? – While all the previously discussed methods perform quite well on images and sometimes video analysis for object detection, the YOLO architecture is one of the most preferred methods for performing object detection in real-time. It achieves high accuracy on most real-time processing tasks with a decent speed and frames per second depending on the device that you’re running the program on.
- Example use cases – Some popular use cases of the YOLO architecture apart from object detection on numerous objects include vehicle detection, animal detection, and person detection. For further information, refer to the following link.
6. RetinaNet
→ Introduction
The RetinaNet model introduced in 2017 became one of the best models with single-shot object detection capabilities that could surpass other popular object detection algorithms during this time. When the RetinaNet Architecture was released, the object detection capabilities exceeded that of the Yolo v2 and the SSD models. While maintaining the same speed as these models, it was also able to compete with the R-CNN family in terms of accuracy. Due to these reasons, the RetinaNet model finds a high usage in detecting objects through satellite imagery.
→ Overview of architecture

The RetinaNet architecture is built in such a way that the previous issues of single-shot detectors are somewhat balanced out to produce more effective and efficient results. In this model architecture, the cross-entropy loss in the previous models is replaced with the focal loss. The focal loss handles the class imbalance problems that exist in architectures like YOLO and SSD. The RetinaNet model is a combination of three main entities.
RetinaNet is built using three factors, namely the ResNet model (specifically ResNet-101), the feature pyramid network (FPN), and the focal loss. The feature pyramid network is one of the best methods for overcoming a majority of the shortcomings of the previous architecture. It helps in combining the semantic rich features of lower resolution images with that of the semantically weak features of the higher resolution images.
In the final output, we can create both the classification and regression models similar to the other object detection methods discussed previously. The classification network is used for appropriate multi-class predictions, while the regression network is built to predict the appropriate bounding boxes for the classified entities. For further information and reading on this topic, check out the article or video guides respectively from the following links, [1] and [2].
→ Points to consider
- When to use RetinaNet? – RetinaNet is currently one of the best methods for object detection in a number of different tasks. It can be used as a replacement for a single-shot detector for a multitude of tasks to achieve quick and accurate results for images.
- Example use cases – There’s a wide array of applications that can be performed with the RetinaNet object detection algorithm. A high-level application of RetinaNet is used for object detection in aerial and satellite imagery.
Object detection libraries
1. ImageAI
→ Introduction
The ImageAI library aims to provide developers with a multitude of computer vision algorithms and deep learning methodologies to complete tasks related to object detection and image processing. The primary objective of the ImageAI library is to provide an effective approach to coding object detection projects with a few lines of code.
For further information on this topic, make sure to visit the official documentation of the ImageAI library from the following link. Most of the code blocks that are available are written with the help of the Python programming language along with the popular deep learning framework Tensorflow. As of June 2021, this library makes use of a PyTorch backend for the computation of image processing tasks.
→ Overview
The ImageAI library supports a ton of operations related to object detection, namely image recognition, image object detection, video object detection, video detection analysis, Custom Image Recognition Training and Inference, and Custom Objects Detection Training and Inference. The image recognition functionality can recognize up to 1000 different objects in a particular image.
The image and video object detection task will help to detect 80 of the most common objects seen in daily life. The video detection analysis will help to compute the timely analysis of any particular object that is detected in a video or in real-time. It’s also possible to introduce custom images for training your own samples in this library. You can train a lot more objects for the object detection task with the help of newer images and datasets.
→ GitHub Reference
For further information and reading on the ImageAI library, refer to the following GitHub Reference.
2. GluonCV
→ Introduction
The GluonCV is one of the best library frameworks with most of the state-of-the-art implementations for deep learning algorithms for various computer vision applications. The primary objective of this library is to help the enthusiasts of this field to achieve productive results in a shorter time period. It has some of the best features with a large set of training datasets, implementation techniques, and carefully designed APIs.
→ Overview
The GluonCV library framework supports an extensive number of tasks that you can accomplish with it. These projects include image classification tasks, object detection tasks in image, video, or real-time, semantic segmentation and instance segmentation, pose estimation to determine the pose of a particular body, and action recognition to detect the type of human activity being performed. These features make this library one of the best object detection libraries to achieve quicker results.
This framework provides all the state-of-the-art techniques required for performing the previously mentioned tasks. It supports both MXNet and PyTorch and has a wide array of tutorials and additional support from which you can start exploring numerous concepts. It contains a large number of training models from which you can explore and create a particular machine learning model of your choice to perform the specific task.
With either the MXNet or PyTorch installed in your virtual environment, you can follow this link to get you started with the simple installation of this object detection library. You can choose your specific setup for the library. It also allows you access to the Model Zoo, which is one of the best platforms for easy deployment of machine learning models. All these features make GluonCV a great object detection library.
→ GitHub Reference
For further information and reading on this library, check out the following GitHub Reference.
3. Detectron2
→ Introduction
The Detectron2 framework developed by Facebook’s AI research (FAIR) team is considered to be a next-generation library that supports most of the state-of-the-art detection techniques, object detection methods, and segmentation algorithms. The Detectron2 library is a PyTorch-based object detection framework. The library is highly flexible and extensible, providing users with multiple high-quality implementation algorithms and techniques. It also supports numerous applications and production projects on Facebook.
→ Overview
The Detectron2 library developed on PyTorch by FaceBook has tremendous applications and can be trained on single or multiple GPUs to produce fast and effective results. With the help of this library, you can implement several high-quality object detection algorithms to achieve the best results. These state-of-the-art technologies and object detection algorithms supported by the library include
DensePose, panoptic feature pyramid networks, and numerous other variations of the pioneering Mask R-CNN model family. [1]
The Detectron2 library also allows the users to train custom models and datasets with ease. The installation procedure for the following is quite simple. The only dependencies that you require for the following is PyTorch and the COCO API. Once you have the following requirements, you can proceed to install the Detectron2 model and train a multitude of models with ease. For learning more and understanding how exactly you can use the following library, you can use the following guide.
→ GitHub Reference
For further information and reading on this library, check out the following GitHub Reference.
4. YOLOv3_TensorFlow
→ Introduction
The YOLO v3 model is one of the successful implementations of the YOLO series, which was released in 2018. The third version of YOLO improves on the previous models. The performance of this model is better than its predecessors in terms of both speed and accuracy. Unlike the other architectures, it can also perform decently on smaller objects with good precision. The only main concern in comparison to other major algorithms is the tradeoff between speed and accuracy.
→ Overview
The YOLOv3_TensorFlow library is one of the earliest implementations of the YOLO architecture for object detection processing and computing. It provides extremely fast GPU computations, effective results and data pipelines, weight conversions, faster training times, and a lot more. While the library can be obtained from the link provided in the next section, the support has stopped for this framework (similarly to most others) and it’s now supported with PyTorch instead.
→ GitHub Reference
For further information and reading on YOLO, refer to the following GitHub Reference.
5. Darkflow
→ Introduction
Darkflow is inspired by the darknet framework and is basically a translation to suit the Python programming language and TensorFlow for making it accessible to a wider range of audiences. Darknet is an early implementation of an object detection library with C and CUDA. The installation and working procedures for this library are quite simple and easy to perform. The framework also supports both CPU and GPU computations of object detection tasks to achieve the best results in either scenario.
→ Overview
The dark flow framework requires some basic necessities for its implementation. Some of these basic requirements are Python3, TensorFlow, Numpy, and Opencv. With these dependencies, you can start computing tasks related to object detection with ease. With the dark flow library, you can achieve a lot of tasks. The dark flow framework has access to YOLO models, and you can download custom weights for a variety of models.
Some of the tasks that the darkflow library helps you accomplish include parsing the annotations, designing the network according to a specific configuration, plotting the graphs with flow, training a new model, training on a custom dataset, creating a real-time or video file, using the Darkflow framework for other similar applications, and finally, it also allows you to save these models in the protobuf (.pb) format.
→ GitHub Reference
For further information and reading, refer to the following GitHub Reference.
Conclusion
Object detection is still one of the most essential deep learning and computer vision applications to date. We’ve seen a lot of improvements and advancements in the methodologies of object detection.
It started with algorithms like the Histogram of Oriented Gradients, introduced way back in 1986 to perform simple object detections on images with decent accuracy. Now, we have modern architectures such as Faster R-CNN, Mask R-CNN, YOLO, and RetinaNet.
The restrictions for object detection are not limited to images, as they can be effectively performed on videos and real-time footage with high accuracy. In the future, a lot more successful algorithms and libraries for object detection still await us.