What Is The Attention Mechanism In Deep Learning?

May 24, 2022 4:07:07 PM

blog banner (1)-4

Less than two decades ago, Deep Learning (DL), or the simulation of human neural networks, was only a concept limited to theory. Fast-forward to the present day and it is being leveraged to solve real-world problems such as converting audio-based speech to text transcripts and in the various implementations of computer vision. The underlying mechanism behind these applications is known as the Attention Mechanism or Attention Model.

A surface-level reading reveals that DL is a subset of Machine Learning (ML), which itself is an extension of Artificial Intelligence (AI). Deep Learning Neural Networks utilize the attention mechanism when dealing with problems related to Natural Language Processing (NLP) such as summarization, comprehension, and story completion. Through the years of research and experimentation in DL, the text analytics functions in these problems have been automated using the attention mechanism.

Before going any further, we must understand what attention is, and how attention works in DL, as well as a high-level understanding of related concepts such as encoder, decoder, and so on. This article is an attempt at deciphering these concepts and further conceptualizing clearly how the attention mechanism is applied in real-world applications.

What Is Attention In Deep Learning?

The concept of attention in DL emerged from the applications of NLP coupled with machine translation. Alex Graves, a lead AI research scientist at DeepMind, the renowned AI research collective, indirectly defined attention. According to his lecture at DeepMind in 2020,

Attention is memory per unit of time.

So attention arose from a set of real-world AI Development instances that have something to do with time-varying data. In terms of machine learning concepts, such collections of data are known as sequences. The earliest machine learning model problem that the attention mechanism derives its concepts from is known as the Sequence to Sequence (Seq2Seq) learning model.

How Does The Attention Mechanism Work? 

Attention is one of the most researched concepts in the domain of deep learning for problems such as neural machine translation and image captioning. There are certain supporting concepts that help better explain the attention mechanism idea as a whole, such as Seq2Seq models, encoders, decoders, hidden states, context vectors, and so on.

When defining attention in simple terms, it refers to focusing on a certain component of the input problem and taking greater notice of it. The DL-based attention mechanism is also based on directing your focus, and paying greater attention to specific factors of a problem when processing data relevant to the problem.

Let us consider a sentence in English: "I hope you are doing well".

Our goal is to translate this sentence into Spanish. So, while the input sequence is the English sentence, the output sequence is supposed to be "Espero que lo estás haciendo bien".

For each word in the output sequence, the attention mechanism maps the relevant words in the input sequence. So, "Espero" in the output sequence will be mapped to "I hope" in the input sequence. 

Higher 'weights' or relevance are assigned to input sequence words in relation to the appropriate words in the output sequence. The accuracy of output prediction is enhanced by doing this as the attention model is more capable of producing relevant output.

Customer Success Story: Daffodil helps a geospatial AI firm to map more than 30 cities by training machine learning models.

Attention In Sequence To Sequence Models

The Seq2Seq learning model is what gave rise to the attention mechanism. A better explanation is that attention was introduced to resolve the main issue with Seq2Seq models. To begin with, Seq2Seq models utilize the encoder-decoder architecture to solve a problem, be it translating a sentence or identifying the elements of an image.

blog image (1)

Long input sequences and images with more than one element are often difficult for these models to process accurately. Each element of an input sequence is turned into a hidden state in an encoder to be fed into the next element. During the decoding process, only the last hidden state with some weighted component is used to set the context for the corresponding element of the output sequence.

With an attention model, the hidden states of the input sequence are all retained and utilized during the decoding process. A unique mapping is created between each time step of the decoder output and the encoder input. Each element of the output sequence coming out of the decoder has access to the entire input sequence to select the appropriate elements for the output.

Types Of Attention Mechanisms

Attention mechanisms differ based on where the particular attention mechanism or model finds its application. Another distinction is the areas or relevant parts of the input sequence where the model focuses and places its attention. The following are the types:

1)Generalized Attention

When a sequence of words or an image is fed to a generalized attention model, it verifies each element of the input sequence and compares it against the output sequence. So, each iteration involves the mechanism's encoder capturing the input sequence and comparing it with each element of the decoder's sequence. From the comparison scores, the mechanism then selects the words or parts of the image that it needs to pay attention to.


The self-attention mechanism is also sometimes referred to as the intra-attention mechanism. It is so-called because it picks up particular parts at different positions in the input sequence and over time it computes an initial composition of the output sequence. It does not take into consideration the output sequence as there is no manual data entry procedure where the prediction of the output sequence is assisted in any way.

3)Multi-Head Attention

Multi-head attention is a transformer model of attention mechanism. When the attention module repeats its computations over several iterations, each computation forms parallel layers known as attention heads. Each separate head independently passes the input sequence and corresponding output sequence element through a separate head. A final attention score is produced by combining attention scores at each head so that every nuance of the input sequence is taken into consideration.

4)Additive Attention

This type of attention also known as the Bahdanau attention mechanism makes use of attention alignment scores based on a number of factors. These alignment scores are calculated at different points in a neural network. Source or input sequence words are correlated with target or output sequence words but not to an exact degree. This correlation takes into account all hidden states and the final alignment score is the summation of the matrix of alignment scores.

5)Global Attention

This type of attention mechanism is also referred to as the Luong mechanism. This is a multiplicative attention model which is an improvement over the Bahdanau model. In situations where neural machine translations are required, the Luong model can either attend to all source words or predict the target sentence, thereby attending to a smaller subset of words. While both the global and local attention models are equally viable, the context vectors used in each method tend to differ.

ALSO READ: What is Data Augmentation in Deep Learning?

Attention Mechanism Helps In Automating Deep Learning Applications

The attention mechanism has been instrumental in real-world problems related to the automated processing of speech, images, and text. The automation of computer vision and NLP applications using the attention mechanism has had a widespread and long-lasting impact in industries such as legal advisory, education, streaming, logistics, supply chain, finance, and more. If you are on the look around for technological intervention with AI and ML for automating your digital workflows, book a free consultation with us today.

Allen Victor

Written by Allen Victor

Writes content around viral technologies and strives to make them accessible for the layman. Follow his simplistic thought pieces that focus on software solutions for industry-specific pressure points.