20th September 2020 • Joel Tan
This is an introduction to the R-CNN family of models, used frequently in object detection computer vision tasks. Readers are assumed to be familiar with traditional CNNs (Convolution Neural Networks), image classification, and the idea of using CNNs as a feature extractor for images.
Unlike image classification, where each image contains exactly one object, and the goal is simply to classify said object into several predefined possible classes, object detection is a far more difficult problem where we have to classify any number of objects present in an image into their respective classes. As such, traditional CNN architectures for image classification problems which assume a fixed number (typically, just 1) of multiclass outputs, are not immediately applicable to detection problems, due to the varying number of objects in different images. Thus, we have specialized algorithms and methods to deal with detection problems, with R-CNNs being one of the most popular classes of such algorithms.
Note that we typically use the mean average precision (mAP) metric to evaluate the performance of an object detection model. I recommend reading this article to have a clearer understanding of what this metric means.
As a whole, the vanilla R-CNN (‘Regions with CNN Features’), introduced in this paper by Ross Girshick et al. in 2013, uses a 2-stage mechanism, whereby we first propose a fixed number (in the original paper, 2000) of rectangular subregions containing interesting features within the image, otherwise known as regions of interest (ROI). Afterwards, we apply a CNN model on each of these ROIs to extract CNN features for them. Using these CNN features, we can then apply any regular classification model to determine the class of the object.
In fact, this 2-stage idea of splitting our original image into smaller subregions followed by applying a classification model on each of these subregions is not novel, and had been a popular strategy even before R-CNNs.
The most basic method to generate smaller subregions of the original image is the sliding window method, which is to simply exhaustively search through the entire image by sliding a ‘window’ of fixed aspect ratio, each time using the pixels within the window as a possible subregion. This is an easy-to-implement naive method which requires no model training or computer vision techniques, but unfortunately suffers from several big issues, such as the inability to deal with objects of varying aspects ratios due to different camera angles for the image taken; as well as computationally inefficiency, as a large amount of time is wasted on unimportant regions with nothing of interest within them.
Instead, we can do better, by using a class of algorithms known as region proposal algorithms to instead propose a much smaller set of subregions which are likely to contain interesting information (e.g. objects) in the image. R-CNN uses one such algorithm, called Selective Search (from this paper in 2012), which uses traditional (non-deep learning) computer vision techniques to select several regions of interest, along with each of their ‘objectness’ score, which is a measure of how likely an object exists within said region. This article goes into greater detail for the Selective Search algorithm, and I recommend reading it for more details.
As compared to its predecessor region proposal algorithms, Selective Search was preferred due to its much higher recall (i.e. regions containing objects are very unlikely to be missed out by the algorithm), and relatively faster speed (it is still too slow for real-time purposes).
The R-CNN architecture can then be described as follows: We first apply Selective Search on the input image to attain 2000 ROIs. For each of these regions, we resize the region to a fixed size, then pass it as input into a pre-trained CNN to extract 4096 features. These features are then passed separately as inputs to two different models: a SVM model for classifying the object (there is also an additional catch-all class, ‘background’ to allow for the prediction of having no objects present), as well as a standard regression model to more accurately predict the actual bounding boxes of the object.
Note that the CNN feature extraction model, bounding box regression model, and the SVM model are shared among the different regions.
The output of this procedure will then be a list of 2000 different bounding boxes and SVM predictions, from which we can attain the respective predicted bounding boxes and classes of the objects.
Typically, the CNN feature extractor is a pre-trained model, while the SVM and bounding box regression model can be trained using standard optimization methods for supervised learning, using labelled training data, which consists of the CNN-extracted features as inputs, and the respective ground truth bounding box and classes as outputs.
While this vanilla R-CNN proved to be a breakthrough at the time in terms of mAP scores for object detection problems, the model is still rather computationally expensive, and suffers from very slow training and evaluation speeds. The major speed bottleneck is primarily due to having to run the CNN model on each of the 2000 ROIs for each image, as well as having three separate models (CNN, SVM, bounding box regression) of which training and evaluation computations are mostly independent and not shared.
Since then, there have been several profound research breakthroughs to tackle these major speed deficiencies.
Fast R-CNNs (this paper, by Ross Girshick in 2015) introduces a simple yet profound change - by flipping the order of Selective Search and CNN feature extraction, we are now able to cut down massively on the number of computations, as we only need to apply the CNN feature extractor once (instead of 2000 times) for each image.
Furthermore, the SVM and bounding box regression models are both replaced by a softmax+regression layer, which acts as the final layer of a neural network. This allows for shared training and evaluation, thus saving even more computation time.
Faster R-CNN (this paper, by Shaoqing Ren et al. in 2015) improves even more on the speed of Fast R-CNNs, this time by tackling the primary bottleneck of Fast R-CNNs, which is the rather slow Selective Search algorithm, replacing this with the novel Region Proposal Network (RPN): a neural network which aims to propose ROIs. This change allows for the region proposal procedure to be integrated within the entire neural network architecture, thus allowing for even faster training and evaluation speeds.
The chart below showcases the major differences between the evaluation speeds of these 3 variants of R-CNNs.
Nevertheless, improvements do not end here. Faster R-CNN still suffers from some drawbacks, namely, the introduction of RPN turns the overall training procedure into a multi-task training problem, which are notably trickier to deal with. Furthermore, Faster R-CNNs are still simply not fast enough to be a real-time object detector.
Mask R-CNNs, introduced in this paper in 2017, builds on Faster R-CNN, and demonstrates even further speed increases. Alternatively, for real-time purposes, the YOLO (You Only Look Once) family of models are much more suitable as they are much faster than the R-CNN family of models, with the tradeoff being slightly lower mAP scores.
Nowadays, these variants of R-CNNs introduced in this article are largely obsolete or outdated. Nevertheless, appreciating the fundamental ideas and concepts introduced by these models can still help us greatly in understanding more modern advancements in object detection, as these ideas have undoubtedly paved the way for many of the latest state-of-the-art, accuracy-focused detection models.