Recent developments in deep learning have ushered in an era of unprecedented opportunities for computer vision, *natural language processing* (NLP), and other applications. While there are some skeptics that this is indeed the case, most agree that we’re now in very early stages of AI with potential far-reaching consequences.

In fact, many believe that we’ve already reached or surpassed “general intelligence” where computers can perform just as well as humans at certain tasks. This has huge implications not only for productivity and efficiency but also for how society functions.

If you’re new to ML, then this article will give you a good starting point by taking a look at five different types of data used for *training neural networks*. They’ll also review some *key terms like feature*, layer, and neuron.

## Standard normalization

When working with neural networks, one of the most important steps is standardizing or scaling your data. This step removes any bias that features have on the outcome.

Data sets that are not normalized can contribute to overfitting as the *network learns odd patterns* in the training set that correlate with the features.

When applying a model to new data, these biases will also apply even if the features are different. In other words, the model may think that every instance of a feature means the same thing!

Standard normalization methods include mean subtraction, variance equalization, and z-score transformation. All of these concepts apply directly here, so let’s take a closer look at them. The average value of all instances of a feature should be zero, and the std (standard deviation) of each feature must be known.

Once those two values are calculated, then the feature can be **scaled using either** its own std or the std of the entire dataset. For example, if we wanted to scale our color feature by 2, then we *would use stdev* * 2 for both the individual feature and overall.

## Feature normalization

In addition to feature scaling, another way to make your features more comparable is called feature normalization. This works by taking each feature and dividing it by some number or numbers.

The most common type of normalized feature is what’s known as z-scores. Each feature would be divided by the mean and then multiplied by a set amount to get its new z-score.

For example, if our feature was the length of an object, we could divide this value by the average length of all objects and then multiply by 2 so that the values are between 0 and 1. Any lengths above one unit would have a higher score than the norm, and any lengths below one would have a lower score.

This can also be done for numeric features, like the price of an item! However, when doing this with prices there is something important to note. Because very expensive items tend to have **much larger numerical values** than *less expensive ones*, you will need to use logarational (log) scales instead of regular linear. More information about this can be found here.

Using normalized features helps avoid overfitting because it forces your model to depend more heavily on other features rather than just the raw data. You *may also notice better performance due* to improved generalizability.

## Layer normalization

Layer normalization is an interesting technique that has become very popular in recent years. The term “normalizing” refers to making use of the layer before it to help regulate how much its next input can influence the output.

The most well-known example of this comes from Geoffrey E. Hinton, one of the creators of deep learning. In his work on neural networks, he would add what are called “squashing units.” These were layers that simply reduced the size of their inputs while keeping some internal structure.

By adding these squashing units to the network, they acted as a kind of regulation or control for the neurons behind them. This allowed for more complex functions to be learned because the system had more freedom to **explore different solutions**.

In layman terms, this means that by using layer normals, your model will *learn sharper features* and patterns.

## Group normalization

A very popular layer that has **seen many variations** is group normalization. This layer brings some changes to the way we compute norms of vectors!

The inner product between two vectors (A, B) can be written as follows:

$$\langle A,B \rangle = \sum_{i=1}^n A_ib_i $$

This *formula makes sense* because it corresponds with how you would calculate the length or norm of vector A if each element was multiplied by its own corresponding coefficient $b_i$. In this case, however, all the coefficients are gathered into one set of coefficients called a “group”.

By instead taking the average of these groups, we get an adjusted value for the norm. The average of these group values is what determines the final norm.

## Unit normalization

A second way to preprocess data is unit scaling or normalizing. This method changes the units of measurement for each feature, which shifts the **focus onto comparing differences instead** of absolute numbers.

For example, say you wanted to predict whether an individual will spend more than $1,000 on fashion products online. You could use price as a predictor, but it would be difficult to compare one person’s budget with another person’s budget because they **may buy different quantities** of items.

By using normalized prices (for example, per dollar), it becomes possible to make comparisons between individuals. Once again, predicting how much someone spends online can be done through math!

Another reason why this technique is important is due to the fact that some features are not linearly related to your outcome. For instance, if we were talking about clothing, hair length is not necessarily linked to whether people perceive you as intelligent or not.

However, when you apply linear regression to these types of features, it assumes that *longer hairs indicate intelligence* and thus, underestimates the importance of this feature. By normalizing, you remove the assumption that every variable is associated with your dependent variables.

Unit scaling and unit normlzation both play an integral part in ensuring that your model works properly. When applied correctly, *deep learning models learn* more information from the data.

## Min-max normalization

A popular way to normalize data is called min–max normalization. This method removes the bias that exists when your features are not scaled properly.

When features go beyond their natural scale, there can be an overcompensating effect. As such, some classes may receive very large rewards while other classes may get very small rewards. To fix this, you need to make the features more balanced so that *every class receives* a fair chance at good reward.

Min–max normalization does just that by taking the minimum value of each feature and then dividing it by its maximum value. The result is a normalized feature with a range between 0 and 1.

Example: Let’s look at our examples from before where we had two features and one threshold.

Feature A has a range of [-1, 1]. When we run into a new instance, we can add these values together to see how much greater or smaller Feature A is compared to the rest.

For example, if Feature A was twice as long, the total would be 2.

Threshold B also has a range of [0, 1], so we can use the same process to find the total for Features A and B and compare them to Threshold B. In this case, the total will be half of the length of both features plus Threshold B, which equals zero.

## Z-score normalization

In additional to scaling values by their mean or standard deviation, another way to normalize data is using z-scores. This method shifts and **scales numeric values** so that they have a mean of 0 and a standard deviation of 1.

Values with **large absolute numbers** are shifted up and reduced in strength, while values with small numbers are increased and boosted in influence. To achieve this, there is an equation needed for z-scoring which calculates how much to shift and scale each value. The formula for this is:

z = (x – μ)/σ

where x is the **current numerical value**, μ is the average of all numerical values, and σ is the standard deviation of all numerical values.

## Probability normalization

A very common technique used in neural networks is probability normalization. This is done at every layer of the network, where the layers are learning about different features of the data.

The way it works is by taking the output of each layer and dividing it by its own corresponding term in the equation. The terms in this equation are called activation functions or simply activation values. These **include things like sigmoid** (0 to 1) or softmax (multiply by numbers between 0 and 1).

After that, the resulting value is scaled up or down depending on whether the activation function is positive or negative. For instance, if we had an activation value of -1 then multiplying it by -2 would result in 2, which makes sense because it’s going up! If however the activation was +3, then scaling it down by -2 would result in zero, which doesn’t make much sense.

This process happens for several reasons, but **one major reason** is so that the overall scale of the number being computed becomes more meaningful. Let’s look again at our example from before where we were trying to predict whether something is an elephant using pictures.

If we just use the raw numbers as inputs into another formula, there isn’t *really anything special* about 3. Most other numbers don’t relate to elephants or not, making the **results less intuitive**.