Over the past couple of years, there have been some significant developments in neural network architectures for computer vision applications. Some of these new architectures are referred to as deep learning or deep neural networks (DNNs). There has even been an explosion of interest around so-called “deep reinforcement learning” algorithms that leverage DNNs to learn how to perform tasks such as playing games!

While there have been many different types of DNN architectures, one of the most popular is what’s called a recurrent neural network (or RNN). A typical RNN contains long short term memory (LSTM) units which help it remember information about the sequence longer than just the next item.

There are several reasons why people like using LSTMs when doing NN training. One reason is because they can find patterns across very large sequences easily. For instance, if you look at lots of videos online, you will **probably see something repeated** over and over again. An example of this would be the pattern of letters making up the word ‘ABCDEFG’ where each letter appears repeatedly.

Another advantage of having LSTMs in your model is that they allow you to add more complex relationships between items in the sequence. This is *important since natural languages tend* to use complicated structures and concepts.

## Definition of Q-networks

What is a deep neural network? A deep neural network (DNN) is an *automated machine learning system* that uses artificial neurons to learn complex patterns in data. Neural networks have three main components: input layer, hidden layers, and output layer. The *input layer receives information* from the environment or source material, the internal layers process this information, and then the final layer produces the estimated outcome.

The term “neural” comes from the *way biological nerves work*. When you put electricity through nerves, they activate and grow and connect with other muscles. That’s how they function! So, when people talk about using electrical signals for thinking, it’s referring to these self-organizing systems of nodes working together.

In DNNs, each node is one individual neuron. In general, there are more than one hidden layer in most DNN architectures. This structure is referred to as being multi-layered because you can use the next set of neurons to do additional processing before producing your final result.

There are two major types of DNN architectures: convolutional networks and recurrent networks. Recurrent networks are structured similarly to what we know as memory – they repeat some processes over and over again. Convolutional networks focus on local spatial relationships between parts of the image, so they look at small regions of the picture and find commonalities within those areas.

## Differences between Q-learning and Q-networks

There are two main types of algorithms that use a neural network to play games – deep q learning (DQN) and deep q netwoorks (dqn). Both of these were pioneered by researchers at UC Berkeley, but they differ in how they apply the term “neural networks”.

The key difference is what kind of layer you have in your NN. In DQNs, there is an input layer, one or more hidden layers, and then an output layer just like any other *feedforward artificial neuron networks*.

In contrast, a q-network has what we call a reward prediction unit (RPU) in place of a hidden layer. The RPU takes inputs only from the current state and outputs a **scalar value representing predicted future rewards**.

These values are called action values because they represent the expected return for taking each possible action given the current situation. For example, if we want to maximize the value of the outcome of this game, then choosing option B would be the best choice since it gives us the highest action value.

## Popular Q-networks

Another popular approach to reinforcement learning is using what’s called a Q-network. A Q-network learns by trying to maximize the expected reward of actions, not just how likely it is that those actions will succeed in achieving its goal, but also how much value each action produces for the agent while it’s executing it.

This additional component of the algorithm — what’s known as the state–action value function or sate–value function — was first proposed in 1989 by Richard Sutton and Andrew Barto at MIRI (now part of UC Berkeley). They dubbed this new way of defining optimal policy “reinforcement learning with general objectives.”

Since then, many other researchers have built upon these concepts and extended them into more sophisticated algorithms. Some use neural networks instead of hand-coded functions to approximate the state–action value function, and some apply deep learning techniques such as convolutional neural nets or recurrent neural nets to improve efficiency.

But all of these **approaches share one big weakness**: they can get stuck in local optima when solving problems with poor performance. This happens because agents are often susceptible to exploitation where they choose increasingly bad policies until they hit on a good one, and then keep doing that because it works well for now.

## When to use Q-learning or Q-networks

Depending on what you are trying to learn, either one of these *two generalization strategies* can be used!

Q-Learning is typically used when there are no *hard constraints* for an agent to satisfy, only rewards that get higher as it performs better. For example, teaching a robot how to push a box with a goal of getting a high reward for **successfully moving** the box.

With Q-Networks, instead, agents are **given soft constraints** they must meet in order to receive a positive reward. An example of this would be teaching an autonomous vehicle how to navigate around other vehicles.

## Tips for choosing a Q-learning or Q-networks

When it comes to optimizing online learning algorithms, there are two main strategies: using direct optimization or indirect optimization. Direct optimization means trying every possible change you can think of directly in the algorithm’s equations, then seeing what changes perform best.

Indirect optimization uses external information to **help make better decisions**. For example, if your algorithm is designed to maximize reward, instead of just maximizing rewards directly, it *may use past experiences* as clues about good decision making.

Deep reinforcement learning (DRL) uses an internal representation of state and action that evolves over time during training. This neural network learns how to optimize its own representations internally, which allows it to improve with experience.

The key difference between direct optimization and indirect optimization is where the emphasis lies. With indirect optimization, the system looks at examples from the environment and extracts patterns to inform future behavior.

Directly optimizing an equation does not require any background knowledge beyond the math itself, but finding the optimal solution depends on having enough data to evaluate potential solutions. In other words, you need lots and lots of trials before anything works.

## What is the future of Q-learning and Q-networks?

Recent developments in reinforcement learning have shifted the focus away from trying to *maximize overall reward towards maximizing rewards* at individual stages or time steps. This new approach, called deep q-learning, eliminates the need for value functions and instead uses neural networks to learn how to achieve specific tasks.

A key component of this newer method is what’s been coined as a “deep q-network.” A q-network is an algorithm that learns to perform some task by being rewarded only for completing part of the task. For instance, it could be given one job and then rewarded for *getting another job done later*.

This concept was first introduced in 1995 when Willi Hoffmann developed his q-value network, which learned optimal strategies for game play using this concept. Unfortunately, he never published his work due to lack of interest and funding!

Since then, there has not been much progress made with respect to applying these concepts outside of games. That is until 2017 when Ian Goodfellow et al adapted the idea for use in Reinforcement Learning (RL) via the paper titled Neural Quantum Agents.

The main difference between normal RL algorithms and NQAs are the agents themselves. Rather than having a classical mind acting upon a quantum body, they **create two separate minds – one classic** and one quantum.

These two minds interact through a process known as entanglement, where information can be shared even if they are separated physically.

## Definition of Q-learning

What is Q-learning? Simply put, it’s an algorithm that works by taking action A and choosing outcome B as your goal, then figuring out how to get from A to B the most efficiently.

The key part here is “the most efficiently.” As you can probably guess, there are many ways to achieve this goal, and therefore many different strategies for optimizing towards B.

With that said, the term “optimize” isn’t quite right. The word implies reaching a target, which in this case would be B. But what if we just wanted to *keep trying random actions* until they work?

That’s not really optimization either; it’s called trial and error! So instead, we use the more appropriate term “choose alternative strategy X with probability Y.” In other words, choose action A with odds 1/2, pick action B with odds 2/3, and repeat.

This sounds very inefficiently systematic, but actually makes sense when you think about it. You wouldn’t try putting your hand into hot water after learning that heat causes pain, nor would you spend all day picking up objects only to drop them and start over. After enough repetitions, these behaviors emerge and become automatic, efficient routines.

In the same way, *intelligent agents using reinforcement learning learn strategic decisions* through repetition.

## Definition of Q-networks

In reinforcement learning, you create agents that learn how to perform tasks by interacting with their environment. A task is defined as anything that requires action for completion, like playing chess or doing something more complicated like driving.

The agent learns through interaction with the environment. At each step, it chooses an action in the world, gets some feedback (either positive or negative) and receives a reward. The rewards can be numerical, like one time $100 was received, or they can be symbolic, like “you made me very happy”.

A lot of strategies in RL depend on finding what’s called a policy gradient. This means finding ways to maximize the expected return from a given strategy. Policy gradients look at which actions are most likely to result in *high returns using knowledge* of past experiences.

One such algorithm is value function approximation. Value functions estimate the worth of different states in the state space independently. By adding all these values together, we get an overall picture of the system’s valuation of the current situation.

This paper will focus on another type of neural network known as a q-network. These networks were first proposed in 2017 by Hui et al., and have since seen resurgence in popularity. They differ slightly from standard feedforward NNs in that there is no *sigmoid activation layer used* to map inputs to outputs.