4th September 2020 • Tan Xue Ying and Joel Tan
This is a summary of the research paper EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks by Mingxing Tan and Quoc V. Le. Readers are assumed to have a basic knowledge of neural networks.
To grasp the key ideas of this paper, it is crucial to first understand what Convolutional Neural Networks (CNNs) are. Typically used for image and/or video data, CNNs are specialized neural networks that have many applications in our data-driven world, e.g. image classification, face recognition and object detection. In brief, they consist of several sequential stages, where each stage is composed by layers of the same convolution type, performing similar tasks. Each layer consists of several filters; the more the filters it has, the more ‘complex’ the layer is.
The complexity of a CNN as a whole can be roughly estimated by the number of parameters in the whole model, or by the total number of floating point operations (FLOPs). Higher complexity generally leads to a longer training time. For a neural network to capture more complex relationships and thus attain higher accuracy scores, we need to scale up the CNN, which serves the purpose of increasing its complexity. In practice, we typically develop CNNs on a fixed resource budget (e.g. time taken, or FLOPs), scaling it up only when more resources are available for us to use.
There are three typical methods to scale up: by depth (i.e. increasing the total number of layers in the CNN), by width (i.e. increasing the number of filters within each layer), and by image resolution (i.e. increasing the resolution of input images).
In recent years, CNNs are getting more accurate by going extremely big, causing training time to be very long, sometimes near-unfeasible. Neural Architecture Search (NAS), though useful for creating efficient mobile-sized (i.e. small) neural networks, is far too slow for their larger counterparts. It is therefore an important quest to improve the accuracy and efficiency of large neural networks. In fact, this is the primary goal this paper seeks to improve on.
Model Scaling
The paper brings forth two main achievements. Firstly, a compound scaling method is introduced. Instead of scaling by each of the 3 methods individually, or together in an arbitrary manner, the authors developed a principled and simple way to scale up neural networks using all three methods together in an efficient manner.
The idea is simple: We first determine a set of scaling constants (one constant for each of the three methods to scale up, for a total of 3 constants) using a small grid-search. These scaling constants tell us how much weightage each scaling method should have. Using these fixed scaling constants, we then scale up our CNN to meet our new target FLOPs.
The compound scaling method was tested on existing widely used architectures such as ResNet-50 and MobileNetV1/V2 as a baseline: by comparing models using different scaling methods while maintaining comparable FLOPs, the compound scaling method consistently returned the model with the highest top-1 accuracy on ImageNet.
The second achievement is built on the foundation of the first - the development of the widely known EfficientNet Architecture. A novel mobile-sized baseline network: EffNet-B0 was developed using NAS with an optimization goal that considers the tradeoff between maximizing accuracy and minimizing FLOPs. The other members of the EffNet family: EffNet-B1, EffNet-B2, …, up to EffNet-B7 - the largest introduced in the paper, was established by applying the aforementioned compound scaling method on the baseline EffNet-B0.
Process of EffNet-B0
The primary building block for EffNets is the inverted bottleneck residual block (MBConv). EffNet architectures are relatively simple, whereby all EffNets consist of 9 sequential stages (each stage consisting of varying number of layers), an ‘input’ stage, followed by 7 MBConv stages, and then followed by an ‘output’ stage. What makes EffNet models stand out is their relatively higher efficiency; When comparing each of these EffNet models with other similar-performance models, the EffNet models have less parameters, thus requiring much less training time to achieve similar accuracies. Notably, at the time of the paper, the EffNet-B7 model was able to achieve the highest Top-1 accuracy on ImageNet, while using 8.4x fewer parameters than GPipe, the previous best.
Graph showcasing improvements gained from the paper
As of today, EffNets are still producing state-of-the-art performances, and are being adopted for use in the industry. EffNets were also used alongside other novel methods introduced in the past year, including that of NoisyStudent by the Google Brain Team, and FixRes by Facebook AI Research, to achieve some of the highest Top-1 accuracies for ImageNet. Thus, it is without a doubt that EfficientNets have already become a mainstay in the world of CNNs, and is worth your time paying attention to.
Sources and links:
Article source - arXiv
Paper sharing slides for the article - Google Slides