Attention is one of the most important concepts in deep learning. It has many potential applications, such as image classification or natural language processing (NLP). In NLP, attention can be applied at different levels- from level-to-level embeddings to word representations to individual words.
In this article we will focus only on applying it at the word level. This type of attention is known as intra-word attention, and it’s typically used for generating sentence outputs. For instance, given an input string like “The dog was hungry so it ate all the food”, the model would pay special attention to the word ‘hungry’ since it is the starting point of the sentence. Then it would use that information to generate a new sentence with another word substituted for ‘hungry’ – say ‘cat’!
This article will go into more depth about how to implement intra-word attention in PyTorch. We will also take a look at some possible uses and variations of using this technique.
What are neural networks?
Neural networks are interesting computational structures that work by taking input data and applying mathematical operations to it in ways that produce output data. In other words, they are systems of rules or functions that process information.
Specifically, feed-forward artificial neural networks (ANNs) are computer programs that learn how to perform specific tasks by feeding them large amounts of structured data or examples. These examples are referred to as inputs or stimuli, and the outcomes are called outputs.
The ANN learns what the input is given before it is exposed to the input, making it more efficient at figuring out similar patterns than computers that require you to teach it what the input is directly.
Because of this efficiency, neural networks have become ubiquitous across industries, from predicting disease symptoms to finding new drugs.
What is an artificial neural network?
Artificial Neural Networks (ANNs) are computational models inspired by how neurons work in our brains. When you think about it, we have thousands of senses that help tell us what things are like. We perceive sound as waves, touch as energy, sight as light. These perceptions are integrated together into what we know as reality through our sense of touch or perception.
Our sense of taste also plays an integral role in defining what something is. For example, if you put sugar in water, then your mouth will feel thirsty, and you’ll probably realize it tastes sweet!
Artificial neural networks are similar to this idea. Rather than having separate sensors for different features of the world, they use algorithms to combine all these inputs into one source. This source can be thought of as representing the key feature or element of the object being sensed.
In computer science, such functions with internal weights and biases are referred to as layers. By stacking multiple layers, ANNs get very deep. The term “deep learning” was coined because people imagined there might be some kind of layer beyond just time, space, and key features.
Now that we have a little bit of context, let’s take a look at how to train an ANN to recognize keys.
What are activation functions?
An activation function is a mathematical formula used in neural networks to enhance the learning process. Most of these functions have one main purpose: to make the network more likely to produce outputs that match the internal representation of the node or layer they are applied to.
The most common type of activation function is called the sigmoid. This goes by several names including logistic, probabilistic, and binary. The reason it has this name is because when using the sigmoid as an activation function, each output value is considered either 0 (false) or 1 (true).
By taking a linear combination of the input values with the sigmoid, we get a number between 0 and 1. This number represents how likely it is that the corresponding bit in the outcome is true. A higher number means it’s more probable that the bit in question is true, while a lower number means it’s less certain.
There are many reasons why having appropriate activation functions can help your model work better. One important role an activation function plays is helping determine which parts of the net are not changing much from one training iteration to the next. By only adjusting those nodes slightly, the net will learn deeper structures of the data.
Another reason activations play a big part in deep nets is that they allow for faster convergence.
What is cross-entropy?
Cross entropy, also known as loss function or error measure, is a very important metric for optimizing neural networks. It’s what determines how well your model is performing on your training data!
The smaller this value, the better your model is doing its job. The larger the number, the poorer the performance of the model!
In fact, one popular way to evaluate the accuracy of a classifier is by comparing its cross-entropy with that of another competing algorithm. This is called an AUROC (area under receiver operating characteristic) score because it evaluates both true and false positive rates.
A higher number indicates that the model makes more accurate predictions than the other method; thus it has higher overall accuracy. In practice, however, only models with lower cross entropies achieve perfect scores since they make fewer mistakes.
Cross-entropy can be reduced via two main strategies: lowering the activation threshold and increasing the batch size. Both reduce the influence of overfitting by giving the network more data to learn from.
What are loss functions?
A loss function is an equation that calculates how much you want to reduce your accuracy in order to improve another part of your model.
In deep neural networks, one of the most important parts is what we call the _key_, or layer. The key determines which features of the image your network uses to recognize the picture.
By tweaking the key’s parameters, you can make your network use different features of the images- such as lines, curves, and shapes- instead of just people, animals, or landscapes.
By doing this, the algorithm learns more about shape and pattern recognition. Some examples of keys with changing settings include weights, dropout rates, and number of layers.
We will discuss all three types in detail here! But first, let us talk about why having a good key matters.
What are gradient descent and stochastic gradient descent?
Gradient descent is an optimization algorithm that has become very popular in machine learning. It was first described by Richard Ferri in his PhD thesis in 1986!
Ferri tweaked the loss function of traditional linear regression so that it reduced the error more quickly, which made the method efficient for solving hard problems.
Later, he extended this approach to non-linear functions such as logistic regression and neural networks using something called gradient descent.
Gradient descent works by taking small steps in the parameter space (direction) of the model. The parameters of the model are what you can tweak to improve how well the model performs its job.
At each step, the algorithm chooses a direction in parameter space and moves along that direction in the direction of lower cost or better accuracy on some training data. When enough steps have been taken, then the algorithm selects a set of parameters that work best on the test data.
The key here is that these trained models will perform well for tasks similar to those used to create the model. In other words, if you give it new examples like those used to train the model, then it should do just fine!
This process is known as transfer learning because it assumes that the model learned simple concepts from the source domain and now it’s being asked to apply them to new situations.
What are some applications of deep learning?
A key focus of research in computer science for several years now is what are known as “deep neural networks” or DNNs. Neural networks were first proposed by German mathematician Kurt Godel in 1926, but they didn’t see much use until Russian-born American psychologist Walter Pitts realized that you could train them using backpropagation, which adjusts the strengths of connections (weights) in the network according to how well it predicts outcomes.
In his paper introducing this concept, Pitts noted that we have neurons in our nervous systems that connect with other parts of our body to perform functions, like recognizing patterns and shapes. He suggested that computers might also contain such structures, leading to the term “neural network.”
But while there have been attempts at building artificial neural networks ever since, it wasn’t until the 1990s that anyone was able to get these networks to work efficiently. It took another decade before people figured out how to make them work well.
Nowadays, however, almost every major company in the technology industry has experimented with applying DNNs to solving their computational problems. Some even release software packages or APIs built around specific architectures of DNNs to allow others to apply them directly.
Given how popular DNNs have become, many educational resources exist to help beginners learn about them.
What are some disadvantages?
One major disadvantage of deep learning is that it can be difficult to train.
Deep neural networks require a large amount of data in order to work properly, which may not always be available or feasible. This means you will need to spend time gathering information, establishing relationships, and finding examples and instances of what you want to learn.
In addition, when there is no clear pattern for how different categories relate to each other, then it becomes hard to teach the network what those patterns are.
This is why people use deep learning for tasks such as speech recognition and image classification- things where there is already an established set of rules for how humans speak and organize images.
There are ways to mitigate this problem by using feature detectors instead of full classifiers, but these may not perform as well because they do not take advantage of all of the structure present in the data.
Another downfall of deep learning is that it often requires a great deal of computational power and memory to run through all of the possible combinations of classes and features.
Since most applications of AI today rely heavily on computer vision and natural language processing, this can be a limiting factor.