Attention Mechanism in Deep Learning

The introduction of Attention Mechanism has revolutionized the way we work with deep learning models. It is one of the most valuable developments that has given rise to many recent advancements in Natural Language Processing like the Transformer model and Google’s BERT. In this blog, we will explore the concepts behind Attention, its type, and applications in Transformers.

What is Attention?

Attention generally refers to the process of selectively focusing on a specific thing or a topic while ignoring all others. The Attention Mechanism in Deep Learning is also based on a similar concept where it selectively focuses on certain factors during data processing while ignoring the remaining factors. It is the main component of a network’s architecture that helps to manage and measure the quantity of interdependence between the input and output elements and within the input elements.

Why Attention is better than the standard sequence-to-sequence model?

The drawback of the seq2seq models was its inability to process long input sequences accurately. This is due to its limitation of considering only the last state of the encoder RNN as the context vector for the decoder. Attention mechanism was introduced as a solution to overcome this problem. During the process of decoding, it retains and uses all the hidden states of the encoder RNN and maps the output of the decoder to all the hidden states of the input sequence.

Types of Attention Models

The attention models can be categorized into two major types: Bahdanau Attention and Luong Attention. The major differences between these models lie in their computations and architecture while the underlying principles remain the same.

Bahdanau Attention

This model is also called as an Additive model and was proposed by Dzmitry Bahdanau in one of his papers that was aimed at improving the seq2seq model in Machine Learning tasks. It attempted to align the decoder with the right input sentences and then implementing the Attention mechanism.

Here’s how the attention mechanism was implemented in Bahdanau’s paper:

  1. The encoder creates hidden states for each element of the input sequence
  2. Alignment scores are calculated between each of the encoder’s hidden states and the previous decoder hidden state
  3. The alignment scores of each encoder hidden state are combined and converted into a single vector post which it is softmaxed
  4. A context vector is created by multiplying the encoder hidden states and their alignment scores
  5. The new output is produced by concatenating the context vector with the previous decoder output and fed along with the previous decoder hidden state into the decoder RNN for a particular time step
  6. The steps from 2 to 5 repeat itself for each of the decoder’s time step until the output is beyond the specified max length or a token is generated.

Luong Attention

This type is also called Multiplicative Attention and was built on top of the Bahdanau Attention. It was proposed by Thang Luong. The main differences between the two lie in their ability to calculate the alignment scores and the stage at which the Attention mechanism is introduced in the decoder.

Here’s how the attention mechanism was implemented in Luong’s paper:

  1. The encoder creates hidden states for each element of the input sequence
  2. A new hidden state is created for a particular time step by passing the previous decoder output along with its hidden state
  3. Alignment scores are calculated using the encoder hidden states and the newly created decoder hidden state
  4. A single vector is created by combining alignment scores for each encoder hidden state which is then softmaxed
  5. A context vector is generated by multiplying the encoder hidden states and their alignment scores
  6. The new output is produced when the decoder hidden state created in step 2 is concatenated with the context vector
  7. The steps from 2 to 6 repeat itself for each of the decoder’s time step until the output is beyond the specified max length or a token is generated.

Looking for a FREE consultation? Reach out to us at sales@datalabeler.com for top-quality data labeling services.