Training and Loss Function

First, we take each training region of interest labeled with ground truth class u and ground truth bounding box v. Then we take the output generated by the softmax classifier and bounding box regressor and apply the loss function to them. We defined our loss function such that it takes into account both the classification and bounding box localization. This loss function is called multi-task loss. This is defined as follows:

Multi-task Loss

where Lcls is classification loss, and Lloc is localization loss. lambda is a balancing parameter and u is a function (the value of u=0 for background, otherwise  u=1) to make sure that loss is only calculated when we need to define the bounding box. Here, Lcls is the log loss and Lloc  is defined as

Loss function of Fast R-CNN model

Fast R-CNN | ML

Before discussing Fast R-CNN, let’s look at the challenges faced by R-CNN.

  • The training of R-CNN is very slow because each part of the model such as (CNN, SVM classifier, and bounding box) requires training separately and cannot be paralleled.
  • Also, in R-CNN we need to forward and pass every region proposal through the Deep Convolution architecture (that’s up to ~2000 region proposals per image). That explains the amount of time taken to train this model
  • The testing time of inference is also very high. It takes 49 seconds to test an image in R-CNN (along with selective search region proposal generation).

Fast R-CNN works to solve these problems. Let’s look at the architecture of Fast R-CNN.

Fast R-CNN architecture

First, we generate the region proposal from a selective search algorithm. This selective search algorithm generates up to approximately 2000 region proposals. These region proposals (RoI projections) combine with input images passed into a CNN network. This CNN network generates the convolution feature map as output. Then for each object proposal, a Region of Interest (RoI) pooling layer extracts the feature vector of fixed length for each feature map. Every feature vector is then passed into twin layers of softmax classifier and Bbox regression for classification of region proposal and improve the position of the bounding box of that object.

Similar Reads

CNN Network of Fast R-CNN

Fast R-CNN is experimented with three pre-trained ImageNet networks each with 5 max-pooling layers and 5-13 convolution layers (such as VGG-16). There are some changes proposed in this pre-trained network, These changes are:...

Region of Interest (RoI) pooling:

(Source: Fast R-CNN slides)...

Training and Loss Function

First, we take each training region of interest labeled with ground truth class u and ground truth bounding box v. Then we take the output generated by the softmax classifier and bounding box regressor and apply the loss function to them. We defined our loss function such that it takes into account both the classification and bounding box localization. This loss function is called multi-task loss. This is defined as follows:...

Results and Conclusion

Fast R-CNN provided state-of-the-art mAPs on VOC 2007, 2010, and 2012 datasets. It also improves detection time (84 vs 9.5 hrs) and training time (47 vs 0.32 seconds) considerably....

Advantages of Fast R-CNN over R-CNN

The most important reason that Fast R-CNN is faster than R-CNN is that we don’t need to pass 2000 region proposals for every image in the CNN model. Instead, the convNet operation is done only once per image and a feature map is generated from it. Since the whole model is combined and trained in one go. So, there is no need for feature caching. That also decreases disk memory requirement while training. Fast R-CNN also improves mAP as compared to R-CNN on most of the classes of VOC 2007, 10, and 12 datasets....