Deep neural networks have enormous potential. Unfortunately, they’re also notoriously challenging to properly train. Arguably more than any other machine-learning model, a deep neural network is incredibly prone to overfitting.
This is largely a matter of complexity. The more layers in a deep neural network, the more sensitive it becomes to the initial configuration and random weights of its learning algorithm. Consequently, deep neural networks are both more vulnerable to and more heavily influenced by bad training data.
In light of this, it should come as little surprise that much of the work in the field of deep learning involves exploring techniques for overcoming overfitting. Recall that overfitting tends to occur when a model becomes over-specialized on its training data, rendering it incapable of generalizing to new information. Given the complexity and computational power of an average deep neural network, this is a pitfall that is to be avoided at all costs.
That’s where batch normalization comes in. It’s arguably one of the oldest and most common techniques for avoiding overfitting – though these days, it’s far from the only one. It also comes with some considerable advantages that every machine-learning engineer should be aware of.
What is Batch Normalization?
Batch normalization is a method intended to promote faster, stabler training for deep neural networks. Commonly known as batchnorm, this technique re-centers and re-scales inputs at the intermediate layer of a neural network, also known as the hidden layer. It was originally proposed in 2015 as a means of addressing internal covariate shift – a phenomenon in which the input distribution of a neural network changes as it updates its weights during training.
Given that deep neural networks generally train on colossal data sets, normalizing data during the pre-processing stage is highly impractical. This is where the ‘batch’ part of batch normalization comes into play. Rather than modifying the entire data set, the batch normalization algorithm is applied to each individual batch of training data.
Interestingly, though it’s generally accepted that batch normalization both addresses covariate shift and improves both speed and accuracy in deep learning, there exists some contention among machine-learning engineers over precisely how and why it does so.
How does Batch Normalization Work?
Batch normalization typically follows a three-step process. This process is repeated with each new batch of data.
First, the batchnorm layer uses several mathematical functions to identify the mean and standard deviation – or variance – of a batch. It then uses that information to normalize the batch’s activation vector, ensuring that all outputs adhere to a standard deviation. The activation vector will trigger if an output falls too far outside the acceptable deviation, adjusting the output accordingly.
Finally, the layer applies a pair of trainable parameters, 𝛾 and 𝛽, to optimize distribution and output. The former allows the algorithm to modify the standard deviation, while the latter can shift the curve right or left to adjust bias. This is the point at which re-scaling and off-setting take place and helps guarantee accurate normalization.
Batch normalization functions somewhat differently depending on whether it’s applied to the training phase or the evaluation phase. During training, the algorithm has a full batch to work with. It has all the data it requires to accurately normalize its activation vector and adjust both deviation and bias. During testing, there’s no guarantee the model will receive a full batch.
As such, instead of calculating mean and variance, a batch normalization algorithm will predict the estimated mean and estimated standard deviation based on the data that was fed into the model during training.
Lastly, while the most common method for applying batch normalization targets a mean of 0 and a deviation of 1, you may opt for a different normalization strategy or ratio depending on the nature of your data set and the type of model you are training.
Advantages of Batch Normalization
The benefits of batch normalization – and more broadly, of normalization in general – include, but are not limited to:
- Batch-normalized models are able to achieve a considerably higher learning rate than non-normalized models, with no apparent ill effects.
- As mentioned earlier, batch normalization helps to eliminate internal covariate shift, a phenomenon that considerably impacts the accuracy and efficacy of neural networks.
- Batch normalization has also been found to have a positive impact on convergence, allowing deep neural networks to more readily reach a minimum acceptable level of errors.
- Batch normalization eliminates the need for a bias vector, potentially further simplifying whatever model to which it is applied.
Note that batch normalization need not be applied exclusively in deep learning, either. While developed with neural networks in mind, it could feasibly be applied to just about any machine-learning model. You might even consider using batch normalization alongside other techniques like data augmentation or synthesized data.
Benefits aside, there’s another reason you’ll want to apply normalization in deep learning. Without batch normalization, smaller values and data points tend to get lost in a sea of noise, drowned out by larger values. This not only impacts the long-term accuracy of your neural network but also makes it likelier that the model will suffer from overfitting.
Even if you don’t end up with an overfitted neural network, a lack of normalization will greatly increase your training time, as it will take the model longer to learn optimal input and output parameters.
Batch Normalization vs. Layer Normalization
Batchnorm is far from perfect, nor is it the only normalization technique you should consider.
For one, batch normalization algorithms don’t work particularly well with small or varied batch sizes. They also struggle with distributed training, particularly when it takes place across different systems. Finally, depending on the type of deep learning model your engineers use, batch normalization can introduce unnecessary complexity into the training process.
For instance, while you’ll likely experience a great deal of success applying batch normalization in a convolutional neural network, it’s far less suited for recurrent neural networks (RNNs). Due to the way an RNN functions, multiple batch normalization layers are required for even a single batch of data.
Layer normalization may be a better choice for such models. Instead of normalizing inputs across individual batches, it operates across instances and features. This makes it well-suited for both recurrent and sequential machine-learning models.
Layer normalization is technically a form of group normalization, a method that applies normalization across groups of channels instead of batches. It exists on the opposite end of the spectrum from instance normalization, in which each channel is its own separate group.
Layer normalization also operates the same regardless of whether a model is being tracked or evaluated – it doesn’t require separate functions or variables for each phase.
Other normalization techniques include:
- Weight normalization, which targets the weights of a layer rather than its activations. This technique can be combined with mean-only batch normalization to great effect.
- Batch-instance normalization, which attempts to address the weaknesses of instance normalization by combining it with batch normalization.
- Switchable normalization, which combines characteristics of batch, instance, and layer normalization.
How to Apply Batch Normalization
The good news is that the majority of deep learning frameworks already use some combination of normalization techniques – batch normalization among them. There are also multiple pre-built normalization layers available online that engineers can easily add to their neural networks. With that said, there are a few best practices you should bear in mind when determining what normalization techniques to apply:
- Be mindful of what type of neural network you’re attempting to train. While batch normalization works well with network types like multilayer perceptrons and convolutional neural networks, it’s less suited for more complex models like recurrent neural networks.
- While in most cases, you’ll want to apply batch normalization prior to the activation function, there are some edge cases where you’ll receive the best results by applying it post-activation, most notably with non-gaussian distributions.
- Bear in mind that batch normalization may require you to use larger than normal learning rates.
- Exercise caution when attempting to combine batch normalization with other normalization or regularization techniques, such as dropout.
- While batch normalization is typically applied to individual layers, in some cases, you might apply it to every layer, blurring the line between batch and layer normalization to an extent.
Ultimately, batch normalization represents only one of the many ways machine-learning engineers might address the challenges of training a deep-learning neural network. While it’s arguably a technique with which you should familiarize yourself, be careful that you don’t end up over-relying on it.