A common problem that beginners face is how to handle an unbalanced dataset. An unbalanced dataset means there are not enough samples of one class to achieve good results when using deep learning algorithms.
A very popular algorithm in the field of computer vision is known as VGGNet. VGGNets use what’s called convolutional neural networks, which work by taking input layers of information and manipulating it into more complex patterns.
The reason why they are so famous is because they can recognize all sorts of things such as animals, fruits and vegetables, cars, and even people!
However, one thing most VGGNs cannot do is identify whether or not something is human related. This is problematic since almost every person online makes money these days, thus creating a large amount of data for training purposes.
If a VGGNet was able to classify someone as being paid-for content creator (think YouTube video makers, writers, etc.), then the machine would be earning big bucks. Thankfully, we have some solutions here! Read on for them.
Look more closely at the data
There are two important things you can do when an algorithm is performing poorly on one part of your dataset.
The first is to look closer at the data that is being correctly classified as positive or negative. By doing this, you may find that there’s something about the samples in the minority class that makes them different than the others.
By looking into the differences between each sample, we can figure out what made them unique, and then apply these concepts to create similar examples for our own model.
This is called feature engineering, and it’s an essential step in most deep learning projects.
Feature engineering comes after initializing your neural network and choosing your optimizer (you will learn more about those later). After that, you will start extracting features from your training set using mathematical functions and logic.
Some common features include: mean values, proportions, numbers ranges, etc. You get the idea – creating and applying your own rules to determine what makes a given example special.
Ensure the data set is representative of the population
A common beginner mistake when working with datasets is using an unrepresentative sample of the population. This can be done by having biases in who your audience is, or by not representing all parts of the population.
For example, if you are trying to predict whether someone will commit crime or not, it would make sense to have a dataset that only includes people that have committed crimes. While this may seem like a good idea, it actually goes against what most experts tell us about human behavior – we’re all different, and there are many reasons why one person might choose to break the law.
By excluding individuals with criminal histories from your training sets, you are potentially limiting the accuracy of your model. You also risk falsely identifying innocent individuals as likely criminals because of how the algorithm works.
When handling imbalanced datasets, don’t simply add more samples where there aren’t enough representatives- instead, explore methods such as reweighting, undersampling, and oversampling. Read more at https://www.quora.
Use a variety of techniques to balance the data
When working with an unbalanced dataset, there are several different strategies you can use to improve the performance of your model.
One is to simply increase the number of examples of each class proportionally so that it matches the other classes. For example, if one category has only two examples while all others have many, then create another instance of the same class using similar methods as before.
This is not ideal because it may overfit the new example slightly more than necessary. What most models cannot do effectively is recognize when it has enough information so this may be needed!
Another option is called under-sampling or random dropout. This means randomly choosing some instances for no longer exist or being completely removed at times.
Some algorithms require lots of samples so what would happen is that they will keep trying to train without any success until they are given enough data.
A third way to fix this problem is oversampling which is the exact opposite of dropping out. This means creating extra copies of examples to add diversity to the training set.
The easiest way to do this is to take pictures of every possible example and make a copy of it.
Try different algorithms with different parameters
When working with neural networks, one of the most important things is finding an appropriate balance between using smaller datasets or larger ones to train your network.
If you have a very large dataset that contains lots of features, then it becomes difficult to determine which feature will contribute more towards predicting the outcome.
This is because there are just so many!
In such cases, some people drop out parts of the data, or even use only part of the data for training. This is called under-sampling and over-fitting, respectively.
Using too little data can result in poor accuracy and overfit, while using too much may not be enough data to produce good results. A perfect example of this is when companies use big marketing slogans as text to determine if someone has purchased a product before.
That style of advertisement may work well for their current products, but would probably not do well for others. So, they take the first few sentences and decide whether or not to purchase by those.
How does deciding whether or not to buy a product depend on the first few words? Overfitting.
By incorporating too much information into the model, the system learns internal representations of the word sequences themselves instead of the underlying concept. In other words, it learned how to identify a marketing slogan rather than what makes a good product.
Too much data in relation to the number of outcomes can also lead to poorer accuracies due to lower sample sizes.
Understand how to choose your hyperparameters
When doing any kind of machine learning, there are three main components that take up most of the time-training, tuning, and testing. Depending on what you are trying to achieve, you will need to focus on only one or two out of these.
In deep neural networks, a common component is the number of layers and size of each layer. There are also different types of layers such as convolutional and pooling layers, which affect the shape and scope of the network.
Another important part of training is choosing your loss function and optimizing it. What matters here is not just whether your model works but if it works better than before!
The last crucial step is selecting your hyperparameter settings. These include things like batch sizes, momentum, weight decay, and initializers for the weights. Too many bad choices can result in a poor performance score, while too few may prevent the algorithm from converging.
Use cross-validation
When it comes to optimizing your neural networks, one of the most important things is how you test and validate them! There are several ways to do this, but one of the most common methods is called “cross validation”.
Cross validation works by dividing up your dataset into two parts; a part that you use for training and another that you use for testing. You then train your network using only the first set of data and evaluate its performance on the second set. This process is repeated many times with different samples from the same set.
The final accuracy value is an average of all these separate tests. It assumes that if the test sets have similar accuracies, then the model being trained is not changing much as it learns. By taking multiple iterations, you get a more reliable number than just using one long iteration.
By doing this, you can determine the best hyperparameters (the settings of your neural net) for your model. For example, you could find the optimal batch size or learning rate depending on whether the model overfits the data or does not.
Make sure you have enough data
Having an adequate amount of data is essential to training your model properly. If your dataset is not large enough, then your model will not be able to find enough examples of each class to learn from.
If your dataset is very small, then your model won’t be able to determine the most likely outcome of some cases, which it needs to do to predict the right result. This could also mean that your model will make wrong predictions more often than correct ones.
When determining how much data your model needs to effectively work, there are two main factors: variance and reliability.
Variance refers to how well different instances of a pattern or concept match with one another. A high level example would be predicting if a car will break down soon – the longer a car has been used, the higher its variance due to lack of use.
A low degree of variability means that your cars are probably going to stay in good condition for a while! However, this may not apply to all types of vehicles so we cannot say with certainty whether or not your cars will eventually fail.
Reliability refers to how accurately an instance of a pattern repeats itself. For example, if you were trying to identify if a boat was safe to swim in, then numbers rating the safety of various products are a valid source of information.
Try different initializations
When using neural networks, one of the most important steps is choosing your initialization values for the network. These are called the layer’s “weights” or “parameters” which determine how the network learns.
The most common weight initialization method is Xavier-Uniform (XU). This sets all weights to have the same magnitude but with opposite signs. Then, they are randomly redistributed onto the unit circle.
However, this may not be the best option for every problem domain. For example, if you want to recognize cats then starting out by setting random numbers might cause the model to learn something else instead.
That would not make sense because that would be identifying everything as a cat!
Instead, try another popular initialization technique called Heinitialization. It works similarly to XU, except each weight is set independently to either 1/sqrt(n) where n is the number of units in the corresponding layer, or 0.
Note: The first time you use an uninitialized batch norm layer, it will take some minutes to converge. Do not worry about this, just start training again with initialized layers!
You can also try more advanced initializations such as Glorot uniform (GU), Gaussian (GN), or Uniform distribution within [-r, r] (UR).