Over the past few years, artificial intelligence has been in the news a lot! Between stories of self-driving cars taking over the roads and reports about companies developing chatbots to interact with customers online, there’s always something new happening.
If you’re looking to learn more about AI, how it works, and some of the applications that are using it, then this article is for you. We will be going through an algorithm called deep q learning and what makes it so special.
What is deep q learning?
Deep q learning (DQN) was first introduced by Yann LeCun, Yoshua Bengio, and Francois Laroche in their paper titled A Simple Neural Network Algorithm That Generalizes well.1 They described DQNs as “neural networks which get trained indirectly via reinforcement”.2
They go onto say that these neural networks can achieve “state-of-the-art” performance on several tasks and have since become one of the most influential algorithms in computer science.3
A key part of DQN is its ability to generalize. This means that it doesn’t require lots of examples in order to work, making it particularly useful in domains where there may not be very many training instances or it takes longer to gather them.
Generalization also implies that even if parts of the system are not working correctly, the rest still functions normally.
Definition of reinforcement learning
Reinforcement learning is an artificial intelligence (AI) technique that allows computers to learn how to perform tasks by observing and interacting with their environment.
In RL, you give your computer software instructions it should carry out in a certain way, and then let it go about its business while it uses all sorts of tricks to figure out what those instructions mean in the context of the environment.
The more time it spends doing something, the more knowledge it has of that thing! This theory was first proposed back in 1950s psychology as “behavioral analysis” or “operant conditioning.”
It was later adapted for use in economics, where companies incentivize customers to spend money through rewards such as gift cards or discounts. The term “reward” comes from this idea.
But here, the reward isn’t just any old piece of junk, it’s you telling the computer exactly what you want it to do! That’s why it’s called reinforcement — because the more of it it gets, the better it learns.
Representation of states
A state in deep learning is like an item you have or a place you go to. For example, if your goal is to predict whether a movie is good or bad, then your state can be one half-hour clip of the movie that you use to determine its quality.
The representation of this movie segment does not matter as long as it is similar for movies that are considered well-made and those that are not.
By having different representations of the same movie, neural networks learn how to compare and evaluate them. This is called feature extraction.
Neural networks will eventually find patterns in all these features that allow it to make predictions about movies beyond just the length of a clip.
Representations of the next state depend on what the previous state was. If the last state was to predict whether a movie is romantic or action oriented, then the representation of the current state could include information about the genre of the movie and how long the main character talks.
This way, the network learns to connect ideas such as “romantic” and short conversations and concepts such as “action oriented” and longer talk times. You can also add additional layers to understand deeper themes and concepts within each category.
Representation of actions
One of the key components in deep learning is how representations are built for data. Neural networks use layers to create these representation-the most recent layer creating more abstract, high level features.
In classical machine learning, researchers would take an example set of data and develop rules or functions that predict what kind of thing something is by looking at its characteristics. For instance, you could look at all pictures of dogs and see whether they have tails or not to determine if a picture is really of a dog or not.
This method works well when there are clear definitions of things being categorized, but it breaks down when there aren’t. A cat has a tail, so every other picture of a dog with a tail is going to be classified as a dog!
Neural networks avoid this problem by exploring patterns in the relationships between different parts of the examples. If two concepts share a part of their definition then that shared component becomes an additional feature used to identify both concepts. This way, the neural network doesn’t need to define exactly what “dogness” is, only that some aspect of “dogs” shares common traits with the rest of the concept “things with fur and bark.”
However, one challenge classic AI faces is having good quality training samples. It can learn useful information from the right sort of sample, but may fail because the sample it was given wasn’t representative enough.
Representation of rewards
A key element in deep reinforcement learning is how it represents reward. There are two main types of representations used for this.
Expected value (EV) representation – also known as state-value or policy functions, uses an equation to calculate the expected benefit of taking an action given the current state.
reward = Σ s t π i(s) vi(s, a) ∫ s′ t ′π j(s’) vj(s’, a) dt’ dt where pi is the probability of choosing action i at time t, vi is the initial value function representing the cost/benefit of performing action i now, and v is the value function which changes with each decision. The integral symbol denotes an expectation over all possible states that can occur after making your choice, and then repeating for every action available.
The values in these equations depend on both the current state and the chosen actions.
Deep q network – one popular application of EVRL is using neural networks to approximate the value functions and policies.
Q network – another way to use neural nets for RL is via what’s called a Q-network. Here, instead of trying to maximize the overall reward, you only focus on maximizing the outcome from individual experiences.
Representation of Q-value
The second core component to deep q learning is how it represents or calculates what each action in your agent should be given its current state. You can think of this as defining what your agent understands about the game it has been set to play.
In classical reinforcement learning, agents use so called value functions to determine which actions are best at giving you the highest return for a given state. Value functions take into account both short term rewards like money earned from taking an action and longer term returns such as if you invest in a company that will eventually do well.
However, there’s a problem with this approach. If two different states have the same long term reward, then it doesn’t matter which action you pick! This is clearly not ideal when we want our AI to make intelligent decisions.
Luckily, neural networks come equipped with something known as a neuron pool. A neuron pool is essentially just a way to organize information. In NNs, neurons are organized into layers, and the output of one layer becomes the input to the next.
By having multiple layers, the system can learn complex patterns by combining lower level features together. Take a look at the diagram below:
We start out with some inputs x1, x2, and so on until we get a single output o. As you can see, these individual values become higher level concepts (in this case, numbers) due to the organization of the network.
The Bellman equation
A fundamental tool in deep learning is called the “Bellman equation” or just the “bellman equation,” named after its creator Richard M. Bellman. This equation helps determine how to maximize an objective function by setting it equal to zero.
The bellman equation comes up when solving optimization problems. Optimization means finding the best possible outcome for your mental or physical state. For example, optimizing health means choosing foods that are needed to stay healthy and limiting others that could negatively affect you. Finding the optimal weight for someone depends on their body type so they can be as happy with their shape as possible!
By introducing the concept of a cost into our examples, we can apply the bellman equation to optimize other things such as income, happiness, and more. By maximizing the cost (or what we pay) then minimizing the reward (our payoff), we find the optimum result.
This analogy applies to something very specific within neural networks: the stochastic gradient descent algorithm used to train them. We will talk about this more in another article, but for now just know that it finds the maxima or minima of a function by changing the input parameters according to the ratio between the negative and positive value of the function on each iteration.
The TD update equation
In deep learning, an important concept is updating the weights of the network. Up until now, we’ve focused mostly on what are called feed-forward networks, which have discrete layers connected in a way that information flows one layer to the next, but not necessarily very quickly.
In these types of networks, when you input some data into a layer, it takes longer for the neurons in that layer to process the information compared to say, running through Google search on your phone or studying psychology.
That’s because those earlier stages of processing require more simple calculations, so they take longer. But as soon as you hit a threshold, then the neuron fires and the information is routed onto the next stage.
However, sometimes this can take a while since there are lots of levels in a feed-forward network. This is why computers often seem to be taking forever to do something!
With neural nets, though, this isn’t the case. There are many iterations of the net going up and down, so once it gets close to where it wants to go, it flips the switch and keeps moving.
This is how backpropagation works. Backprop trains the net by looking at the error produced by each individual weight and adjusting them appropriately using the chain rule.
The key difference between the two types of updates, however, comes when we run out of trainable parameters.
The SARSA update equation
A more advanced way to use reinforcement learning in computer games is with what’s known as deep q-learning. As mentioned before, using RL for game mastering requires defining an action space and a reward function.
The action space can be defined in terms of different actions that a player could take in a video game, such as moving left, right, up or down. Different rewards can be assigned to each action depending on how well it performed its job in the game!
For example, if your character is trying to escape from a horde of monsters, then giving yourself a better score would be moving away quickly. A good reward would be staying somewhere longer so you have time to get away or finding a place to hide.
Another option would be picking something off the ground and eating it which gives you nutrients. So instead of looking for ways to gain weight, find foods that give you benefits!
What we discussed about value functions and policy networks applies here. Only this time, they are even deeper! Policy neural networks are additional layers of nodes between input and output. They come in various shapes and sizes dependent on what problem you want to solve.
A commonly used type of policy network is called a recurrent neural network (RNN). This one has several layers where information can flow back and forth like a conversation. You can think of the neurons in these layers as remembering past conversations and incorporating them into what comes next.