Neural networks are one of the most powerful learning algorithms in computer science. They can be applied to virtually any problem domain, with lots of applications across industries. When used for classification (identifying whether or not something is an example of a specific category) *neural networks typically outperform* other machine learning approaches!

Deep learning refers to using neural networks that are more complex than what we have seen so far. This increase in complexity allows for greater accuracy and deeper insights into data.

There are many different types of **deep neural network architectures**, which make it difficult to compare one approach against another. General-purpose software such as Google’s TensorFlow makes developing new DNN architectures easy, but picking out the best architecture for your task is beyond our scope here.

This article will focus instead on explaining how you can use some simple tools to help teach a pre-existing DNN algorithm how to classify images. We will also talk about why this technique works, and how you can apply it in the real world.

## What are the different layers in a neural network?

Neural networks contain one or more what we call “layers”. Each layer is built by taking all of your input data, performing some mathematical operation on each element, and then concatenating (combining) these results into a new set of elements.

The number of layers you have depends mostly on how much data you have and how much accuracy you want to achieve. Having too many layers can result in overfitting which means that it becomes very good at fitting your training dataset but will not *generalize well onto new examples*.

Too few *layers cannot learn enough* about your data. This often does not work properly either as there may be feature masking where certain parts of the image or *sound go missing without someone* having to take extra steps to remove them.

## What is backpropagation?

Backpropagation, or gradient-based learning, is one of the most important concepts in deep neural network (DNN) theory. It was first described by Geoffrey Hinton at University of Toronto in 1989!

In backprop, we take the error produced by each layer and increase the weight assigned to that neuron according to how much it contributed to the error. This process is then repeated across all layers, changing the weights until the net errors down!

The key difference between *classical machine learning algorithms* and backprop is that instead of just *adjusting individual parameters independently*, you are also simultaneously updating the overall structure of the model.

This allows DNNs to get better and better as they *learn increasingly complex patterns* from data. Because there are so **many connections within** the networks, they can find more nuanced relationships between different features than with simpler models.

## What is cross-entropy?

Cross entropy, or loss function, is a metric used for determining how close your model’s predictions are to the actual results. It comes in two variants: binary and multiclass. The binary variant calculates how likely it was that the prediction was correct, given that the result was either one or zero (binary outcome). The multiclass version does the same, but instead of just thinking about whether the prediction was right or wrong, it thinks about which class the prediction belongs to (there can be more than two!).

The difference in these versions of the loss function is determined by what we call the margin. A higher margin means that you **gave less importance** to exact matches vs when the prediction is very similar to an incorrect choice. The loss function then becomes lower if you get closer to the right answer, since it doesn’t care as much about exact matches!

That said, the harder it is to identify which category something falls into, the poorer the accuracy of the algorithm! This isn’t necessarily bad, especially if there’s no way to know exactly where something lies on the spectrum of categories. For example, if someone else’s *handwriting looks like* they wrote “favorite movie” then it may be difficult to determine if they really meant it or not.

A lot of times people will use the *mean average error* (MAE) as their loss function because it has some drawbacks.

## What is the K-nearest neighbor algorithm?

The k-nearest neighbor (KNN) algorithm works by looking at how similar other samples are for a given classification. In this case, the class label is determined by what category the sample is in.

By comparing each new sample with all of the known samples, you get a few different values; these **values determine** which classes the sample most closely resembles.

The number of samples needed to be used as a basis for determining the **similarity decreases** as more samples are added. Because of this, it is possible to use very little data to **achieve good results**!

There are *two main types* of KNN: distance-based and feature-based. We will discuss both here.

## What is linear regression?

Linear regression is an important algorithm in machine learning that uses mathematical equations to predict outcomes. It is typically used for predicting continuous, quantitative variables such as numbers!

Linear regression was one of the **first widely applicable algorithms** in AI. In fact, it is so general and powerful that some refer to it as “the workhorse” of many other **algorithms like logistic regression**, neural networks, etc.

What makes this algorithm special is how easily you can apply it. You don’t need any previous experience with math or statistics to use it!

Here’s an example of a sentence based on the above topic and bullet point. Read the whole paragraph before writing your own!

Generalized linear regressions are a type of linear regression where the outcome isn’t just a number but instead has an interval or ratio scale (think height vs. weight).

This article will go over several different ways to learn how to *implement generalized linear regression* into Python.

## What is logistic regression?

Logistic regression is an algorithm that was first introduced in 1959 by Ian Bishop as a way to classify objects (or things like whether something is human or not, for example). Since then, it has become one of the most commonly used algorithms in computer science.

In fact, many companies now use it to solve difficult business problems! This includes solving marketing challenges (is this product relevant to people with this condition?*), understanding customer behavior patterns* (why did someone buy this instead of that product?) and finding new products and features to sell (by using computational methods to determine if there’s potential profit here)!

So what exactly does logistic regression do? It takes input data and creates a probability estimate for each possible outcome. Then, it calculates the average of these probabilities – the more likely the outcome, the higher the average. The final result is the class with the highest average.

A few examples

Let’s look at some practical applications of logical regression. For reference, we will be applying LR to two different datasets: *natural vs printed word documents* and dog breeds.

Natural vs printed document classification

This site collects information about different types of documents (for example, academic papers, novels, etc) and determines which category they belong to (like “fiction novel” or “scientific article”).

## What is a linear SVM?

Linear SVMs are one of the most **fundamental classification algorithms used** in machine learning. Technically, they’re not even classifiers; they’re discriminative functions. A linear SVM takes as input two values (called features or dimensions) for an object and calculates whether that object belongs to a particular category or not.

The way it does this is by creating a hyperplane — a plane in multidimensional space where all objects are equally distant from it. The hyperplane can be shifted and rotated so long as there are no longer than average distances between groups of instances.

By using different features, you get different planes which make the algorithm more powerful. There are many strategies for picking your feature set, but sometimes people just pick randomly and see what works!

Linear SVMs were originally proposed in 1981 by Vladimir Vapnik, a Russian mathematician now known for his work in pattern recognition and statistical inference. Since then, they have been adapted, improved, and generalized in countless ways. You *probably use something like* a linear SVM every day without realizing it.

With enough examples, a linear SVM will learn how to classify things on its own. This process is called training and the resulting model is usually very good at distinguishing new examples according to their labels.

Some applications require lots of training data though, so most *researchers instead develop non*-linear extensions of the algorithm.

## What is a nonlinear SVM?

A non-linear SVM is an algorithm that uses some special functions to classify objects. These functions are called kernels or metrics. Kernels measure how **close two sets** of numbers are, and then use this information to determine if something belongs in one set or another.

The kernel can be thought of as giving a degree of distance between the sets being analyzed. The closer the kernels get, the more similar the sets are. More **distant kernels indicate sets** that are completely different from each other.

By having these distances, the algorithms can *work around saying “*this element does not belong here” by looking at how far it is away from both groups! This is why they use the metric instead of just checking whether there was enough difference to call it into one group or the other.

There are many types of kernels you can implement in a non-linear SVM. Some *examples include polynomial*, sigmoid, and Gaussian kernels. Each has their own strengths and weaknesses depending on what you want to learn.