20th February 2021 • Joel Tan
I’m sure most of us have pondered on the age-old question: Does size matter?
Obviously, as fellow AI enthusiasts, you’d immediately recognize that I was referring to whether the size of a neural network mattered to its performance, and the simple answer is: Yes, it certainly does.
Assuming our data and architectural implementations are kept constant, with sufficient regularization, it is almost always the case that wider, deeper networks attain stronger performances. This makes perfect sense, as a bigger network contains more learnable model parameters, which allows the network to better capture deeper intricacies and patterns within data.
In practice however, the answer to our original question is far more nuanced as model performance isn’t the sole indicator of how ‘good’ our model is. In fact, depending on the use-case, training time, inference time, and memory, all of which are ‘sacrificed’ in our pursuit of larger networks, may be far more significant factors in our choice of model selection. This is a clear issue as practical model development is frequently both heavily restricted by available computational resources and user-latency requirements, and also heavily motivated by continuous and rapid development of models.
So why is it then that most major deep learning architectural breakthroughs of recent times involve huge models with millions, if not billions, of parameters? One likely reason is the prominence of leaderboards for industry-standard datasets, which are often used by researchers to evaluate and compare model performances. An example for image classification would be the ImageNet dataset leaderboard given in Figure 1. New model architectures which are able to compete with, or even better, exceed the currently top-performing models, are often viewed as breakthroughs in the forefront of research, and are aptly recognized as state-of-the-art.
Figure 1: ImageNet Image Classification Leaderboard, showcasing the state-of-the-art models
As one would expect, the most straightforward strategy to attain higher performances, is simply to scale up network architectures by making it wider and/or deeper with more layers. This has led to the unfortunate reality: to compete with the best, one usually has no choice but to use very large networks.
As of today, the best performing models on the ImageNet benchmark contain over 500 million parameters. As comparison, Inception-V1, a popular CNN model in 2014, only had around 5 million parameters. For the majority of us who don’t have access to massive GPU clusters, these gigantic behemoths of models are completely infeasible to train from scratch.
As such, transfer learning is widely-used to leverage the feature representations learnt from these networks, to then be fine-tuned to our specific problem with limited data and time. Although this method certainly helps to greatly mitigate our computational limitations, nevertheless, the fine-tuning procedure may still take a decent chunk of time, and both memory-space limitations and longer inference time still prove a challenge for rapid inference use-cases, and on low-performance devices. Furthermore, transfer learning may not be applicable for niche datasets or tasks which are completely different from any of the deep learning standards.
Thus, a highly important avenue of research is the development of more efficient tricks and methods in developing these deep learning models. Here, efficiency can be roughly represented by a model’s training time, inference time, and size, though the relative importance of each may differ significantly based on the purpose of the model.
I will now briefly go through several advancements in model efficiency and developments of low-latency mobile-sized networks over the last 4 years for image classification CNN models. We begin with one of the most popular mobile-sized CNNs in recent times, often used as a benchmark of comparison, perhaps due to its fitting name: MobileNets.
MobileNetV1 was introduced by Google AI in 2017. Their motivation was simple: Develop a CNN that is both small enough to be feasibly used on low-latency mobile-sized devices, yet still producing competitive performances comparable to that of other giant models. MobileNetV1 used the simple yet efficient depthwise-separable convolution block, which itself consists of 2 layers: a depthwise convolution layer followed by a pointwise (i.e. 1x1) convolution layer, as a replacement to the standard, vanilla convolutional layer.
Figure 2: Depthwise-separable Convolution
The depthwise-separable convolution block can be thought of as a form of factorization of a standard convolution. A normal convolutional layer works by both filtering and combining inputs into a new set of outputs in one step. In comparison, the depthwise-separable convolution block splits this task into the 2 layers: The depthwise convolution applies a single filter to each input channel, and the 1x1 convolution combines the outputs of the depthwise convolution. Notably, the 3x3 depthwise-separable convolution block uses around 9 times lesser computations compared to a standard 3x3 convolution, at only a minor decrease in accuracy. A detailed explanation of the depthwise-separable convolution is given in this article. Furthermore, MobileNetV1 also introduced two intuitive global parameters for scaling up. Depending on resource constraints, practitioners can tweak these parameters to attain the MobileNet that best suits their available resources.
A few months later, Megvii Inc. developed the ShuffleNet architecture. ShuffleNet introduces an efficient channel shuffling operation across group convolution layers, which allows information to be passed on between channel groups, resulting in richer feature representations for only a minor increase in computation. ShuffleNet also uses a base architecture similar to that of the depthwise-separable convolutions of the MobileNetV1. Ultimately, ShuffleNets were able to achieve a higher accuracy than MobileNetV1, for comparably similar computations.
Figure 3: a) without channel shuffling, b) input and output channels are fully related, c) equivalent implementation to b) using the channel shuffle operation introduced in the paper
The authors from MobileNetV1 then introduced the MobileNetV2 in 2018, which improves from its predecessor primarily through the introduction of the inverted residual block (also commonly referred as MBConv), which consists of two novel ideas: inverted residuals and linear bottleneck.
While the standard residual block introduced in the famous ResNet architecture follows a ‘wide -> narrow -> wide’ structure, the inverted block follows a ‘narrow -> wide -> narrow’ structure instead, which greatly reduces the number of parameters within the block, at the cost of some performance. On the other hand, linear bottleneck simply refers to the removal of the final ReLU activation function in the last convolutional layer within a residual block, which the author argues is necessary to compensate for the loss of information due to the inverted residual block. More information is given in this article.
Figure 4: Residual vs Inverted Residual Block, thickness indicates number of channels
Separately, Barret Zoph and Quoc V. Le introduced Neural Architecture Search (NAS) with Reinforcement Learning in 2017, a remarkable method in which the model ‘learns’ its architecture. The authors were motivated to develop a strategy for models to search for an optimal, efficient architectural building block on a smaller dataset, which can then be scaled up and transfer-learned to other, larger datasets.
In essence, NAS can be thought of as a smart grid-search, and works as follows: Rather than defining the entire network architecture as is the usual case, the practitioner instead defines an architecture search space for the model to search through itself using reinforcement learning. This method helps to greatly reduce the burden of the practitioner to perform extensive architectural engineering, though it does suffer from computational limitations, as even a moderately large search space leads to extremely long search times.
Since then, NAS has been improved in several different ways, and has led to the development of several noteworthy architectures such as the NASNet and MNASNet, of which the latter is a mobile-sized network. Other related efficiency-based advancements have used NAS to attain a strong and efficient base network for scaling-up purposes, one of which is the now widely-used EfficientNet architecture, introduced in 2019.
EfficientNets are given in more detail in a previous societal article. The main takeaway from the EfficientNet paper is the introduction of an intuitive, principled compound scaling method to efficiently scale-up CNNs to larger sizes. In fact, many other researchers have used EfficientNet architectures alongside their novel methods to achieve currently state-of-the-art image classification performances, as can be seen from Figure 1.
Lastly, I would like to highlight a very recent advancement, the Normalization-Free Networks (NFNet), introduced by Andrew Brock et al. this month (February 2021). As a preface, Batch-normalization (batchnorm) is a very popular technique which allows very deep networks to be trained at a much faster rate, but it does come with some significant drawbacks. Most importantly, batchnorm introduces dependencies between training examples within a mini-batch, which itself causes various practical issues such as difficulty in replicating results on different hardware, and subtle, hard-to-detect implementation errors.
As the name entails, the main idea of the paper is to remove batchnorm from network architectures, while still being able to train very deep networks. In place of batchnorm, the paper introduces an adaptive gradient clipping method to help mitigate issues which arise due to the lack of batchnorm, such as a poor-conditioned loss landscape. Ultimately, the NFNet models were able to surpass the performances of comparable EfficientNet models, while being 8.7x faster to train.
To summarize, I showcased several impactful advancements in model efficiency, specifically for CNNs. These methods not only help practitioners like you and me train smaller models, but are even more useful in the development of larger models, where even small speedups lead to notable reductions in training time. Finally, my opiniated advice to fellow practitioners is as follows: while scaling up our neural networks may be an easy way to improve performances (with sufficient regularization, of course), we have to be mindful of practical constraints, as it is often the case that the marginal improvements attained from larger models aren’t worth the expensive computational resources and time used to train these models, which could have been better allocated to develop multiple smaller models at a much more rapid pace.
Ultimately, at least in the world of practical AI, bigger doesn’t always equal better.