Workflow of Object Detection
Whether we create a custom object detection model or use a pretrained one, we will need to decide what type of object detection network we want to use:
- Two-Stage Network
- Single-Stage Network
Two-Stage Networks
In Two-Stage Network, we have two stages for processing the prediction, where in stage 1, the model uses the CNN and its variants to identify region proposals (it is the subset of images that may contain an object). After this stage, i.e., in stage 2, the model uses classification and regression to classify and draw boundaries for region proposals.
- Input Preprocessing: In this stage, input images are resized to a fixed size, and pixel values are normalized. Data augmentation techniques may also be employed, using which datasets can be enhanced to capture more variation in input images by adding variations such as rotation, flipping, or brightness adjustments.
- Feature Extraction: The processed image is then passed to a pre-built CNN, which will extract relevant features. In this CNN, we have two types of layers: convolutional layers, which detect edges, shapes, and patterns in the image, and pooling layers, which reduce spatial dimensions and capture invariant features.
- Region Proposal: In this step, we predict candidate regions representing the potential object location or region of interest with the help of the Region Proposal Network (RPN).
- RoI Pooling or RoI Align: After identifying candidate regions, each candidate region is extracted from the feature maps and resized to a fixed size using RoI pooling or RoI align, ensuring that features for each region are spatially aligned.
- Classification and Regression Head: Afterwards, the features of the region are supplied into distinct heads for regression and classification. Each potential region is given a class label by the classification head, and each region’s object’s box coordinates (x, y, width, and height) are refined by the regression head.
- Non-Maximum Suppression (NMS): Now to eliminate redundant detections, NMS is applied to filter out duplicate predictions with high overlap.
- Output Prediction: A collection of bounding boxes representing the detected items in the image, along with the matching class labels, makes up the final output.
Post-processing:
If required, additional post-processing steps might be applied, such as applying thresholds to confidence scores or filtering out predictions based on specific criteria.
Note: These models are slower compared to Single-Stage Network Models but can achieve high accuracy.
Single-Stage Networks
Unlike object detection with Two-Stage Networks which relies external region proposal networks (RPNs) or other methods to generate candidate regions, Single models use a predefined set of anchor boxes to generate candidate regions which can be decoded to generate the final bounding boxes for the objects
- Input Preprocessing: The input image is resized to a fixed size and pixel values are normalized. Some models also employ data augmentation techniques to enhance the dataset.
- Feature Extraction: The processed image is then passed to a pre-built CNN, which will extract relevant features. In this CNN, we have two types of layers: convolutional layers, which detect edges, shapes, and patterns in the image, and pooling layers, which reduce spatial dimensions and capture invariant features.
- Anchor Boxes/Default Boxes: To predict the candidate regions, instead of relying on external region proposal networks (RPNs) or other methods to generate candidate regions, these models use a predefined set of anchor boxes or default boxes. These boxes are designed to cover a range of object sizes and aspect ratios.
- Bounding Box Prediction: Each anchor box predicts bounding box coordinates (x, y, width, and height) and a confidence score for object detection. The confidence score represents the model’s confidence that an object is present within the anchor box.
- Non-Maximum Suppression (NMS): To eliminate redundant detections, NMS is applied to filter out duplicate predictions with high overlap.
- Output Prediction: The output consist of a collection of bounding boxes representing the detected items in the picture, along with the matching class labels.
Post-processing:
If needed additional post-processing steps may be applied, such as applying thresholds to confidence scores or filtering out predictions based on specific criteria.
Note: Thses single-stage networks have higher speeds than two-stage networks, but may not achieve the same precision, in cases such as images having little objects.
Real-Time Object Detection Using TensorFlow
In November 2015, Google’s deep artificial intelligence research division introduced TensorFlow, a cutting-edge machine learning library initially designed for internal purposes. This open-source library revolutionized the field, which helped researchers and developers in building, training, and deploying machine learning models. With TensorFlow, the implementation of various machine learning algorithms and deep learning applications, including image recognition, voice search, and object detection, became seamlessly achievable. In this article, we will delve into the methodologies of object detection leveraging TensorFlow’s capabilities.
Table of Content
- What is Object Detection?
- Approaches to build Object Detection Model
- Workflow of Object Detection
- Object Detection Using Tensorflow