Neural networks have seen dramatic growth in popularity over the past few years, with applications ranging from computer vision to natural language processing (NLP). A crucial component of most neural network architectures is what’s known as a feedforward layer or fully-connected layer. These layers take inputs and produce outputs by applying an activation function to each input element multiplied by the value of a learned weight for that neuron.
The number of possible configurations for these layers can be pretty large depending on how many inputs they receive and how many output neurons they have. Because there are so many possibilities, it makes sense to make some parts of the architecture more important than others. If you don’t pay attention to detail, then your model may work well, but not very efficiently.
In this article we will discuss one such important part of deep learning models which has become increasingly common: batch normalization. Batch normalization was first proposed in 2015 by Greg Ioffe and Sergey Dzintars at Google, and since then it has become one of the most effective ways to improve performance across a wide range of tasks. Even though it was initially designed to apply only during training, researchers now frequently add batch norm to test sets and use them to evaluate accuracy.
Batch normalization decreases the risk of internal covariate shift occurring when using neural networks. Internal covariate shift occurs when the weights of the network change due to changing input data.
One of the most important concepts in deep learning is parameter importance. A very common way to determine how much each individual layer or component contributes to the final result is to evaluate how changing that layer’s setting would effect the performance of the network.
This concept was first proposed by Alex Krizhevsky while he was an undergraduate student at University of Toronto. While studying computer vision, he noticed that dropping the dropout layers (layers which randomly set some neurons to 0 so they do not function as part of the neural net) would have little to no effect on the overall accuracy of his networks.
He concluded that since these layers did not contribute much to the results then they could be removed without having too large of an impact. Since then, this has become one of the main focuses when developing new architectures for CNNs and other types of NNs such as RNNs (recurrent neural nets).
By removing these layers, you save computational time! This is because you are not needing to run internal calculations using already saturated GPU resources. You can also reduce memory requirements slightly due to the reduced number of weights needed to back up the model.
A popular way to evaluate the significance of individual layers in neural networks is through layer-wise regression. In this method, we try to determine the effect that each individual layer has on the final result. By calculating the difference between the performance with all layers and without a specific layer, we can determine what effects it has.
This was first done by Alex Krizhevsky and Ilya Sutskever at University of Toronto for ImageNet classification. They used their model as our base and modified different parts to see how much impact they had on accuracy.
They determined that while changing the last few layers improved test accuracy slightly, replacing the fully connected layers (layers larger than one) significantly decreased accuracy. The reason why these reduce accuracy more is because these layers merge information together and add up non-discriminating features.
By removing these later layers, you lose important nuance data which are not very useful for image recognition. This is an example of why having deeper nets isn’t always better! You have to be careful where you apply deep learning so it doesn’t overfit or learn useless details.
The internal structure of your neural network is one of its most important features. Different architectures have been shown to achieve different levels of accuracy for the same task, with some being much more efficient than others.
The two main components of any deep learning network are the input layer and the output (or prediction) layer. The rest of the layers in between are referred to as “neural networks” or just “networks.”
These additional layers are what makes it possible to combine multiple concepts into one model. By stacking several such networks together, we get very complex models that can learn increasingly complex patterns from our data.
In this article, we will discuss three common types of networks and how they work. You will also learn about some basic terms related to these networks.
Dense or sparse
In computer science, deep learning refers to algorithms that use neural networks to perform specific tasks. Neural networks are computational models inspired by how neurons in our brains work.
In general, neural networks have layers of nodes (think about it like groups of cells within an axon) connected to each other horizontally and vertically. The way these connections are organized is determined by what the network learns from data.
For example, say you want to classify pictures as being either dog photos or cat photo– this would be a very basic task for a neuron network! You could have one node that identifies shape or another node that looks at color, etc. These individual nodes are called features.
The job of a feature layer is to look at part of the image and determine if there’s something important going on in that area. By layering different features together, a more complex understanding of the whole picture can be achieved.
By having many interconnected features, the classification process has access to larger pieces of information and can make better predictions than any single feature alone. This is why creating dense networks — lots of connections between nodes — is such an effective technique for ML applications.
When you are training your network, one of the most important things is batch size. This parameter refers to how many examples your computer has at a time when it is trying to learn by back-propagation.
Concretely, this means that your model will be given an input and it will perform some calculations using the input as well as its own internal state which it can then update. After these updates, it will output a prediction for a new sample.
The larger the batch size, the more inputs your neural net gets per iteration. More data means better accuracy!
However, if you have too large of a batch size, your model may not get enough interaction between samples in each iteration and thus cannot properly adjust itself. This could cause poor performance or even worse, overfitting where it becomes very accurate but does not improve much with additional samples.
Too small of a batch size also restricts the number of interactions per example and limits how quickly your model can adapt to changing conditions.
A common beginner’s mistake is trying to decrease the learning rate too quickly. This can result in your model not enough time to learn before you reduce the learning rate, which eventually leads to poor performance.
If you are experiencing very slow progress, you should hold off on reducing the learning rate until it becomes more stable. This way, your model has enough time to fully converge on its best possible settings!
There are several strategies for choosing when to start decreasing the learning rate. One of the most popular is using an exponential decay. With this method, you pick a starting learning rate and then multiply that by itself up to some number (the decayed value) with each training iteration.
Once there are fewer than ten iterations, you reduce the product factor down to reach a final lower limit. You keep lowering this constant value as many times as needed until the algorithm no longer needs to update the weights. At this point, the network will achieve its optimal accuracy.
Number of epochs
One of the key components in developing an effective neural network is how many times you train your network. Neural networks will get better as they are trained more often, but too many training cycles can be detrimental to overall performance.
As seen with our previous example, longer training periods may result in overfitting. A net that has fit all the samples it was exposed to very well may not perform as well on new data!
This could be due to variance in the datasets that the neural network is being tested on or because the net becomes dependent on the specific samples that were used for training.
By having a net that performs poorly on different samples, we risk giving poor accuracy predictions. You want your model to work across the board, so it cannot depend on sample specifics.
Number of epochs is one factor related to this. The less epochs that you have, the sharper the edge that the overfit loss has on keeping the network from adapting other information.
However, too many epochs can lead to a weaker net that does not improve much after the initial few rounds of training.
In general, as mentioned before, when you are training your model there is an inherent risk of overfitting. A very common way to prevent this is using what’s called regularization.
Regularization is a technique that prevents your model from fitting individual patterns too well. By adding up little costs for over-fitting, it becomes harder to fit the data perfectly.
There are two main types of regularization in deep learning: batch or weight scale regularization and network structure regularization. Batch regularizers reduce the accuracy of the class by introducing errors into the algorithm, while structural ones limit how much a part of the network can learn.
This article will go more in depth about batch normalization but I recommend reading through our beginner level article first! After completing that, come back here for advanced tips. We will also talk about some additional ways to improve the performance of your models.