The introduction of Attention Mechanism has revolutionized the way we work with deep learning models. It is one of the most valuable developments that has given rise to many recent advancements in Natural Language Processing like the Transformer model and Google’s BERT. In this blog, we will explore the concepts behind Attention, its type, and applications in Transformers.
What is Attention?
Attention generally refers to the process of selectively focusing on a specific thing or a topic while ignoring all others. The Attention Mechanism in Deep Learning is also based on a similar concept where it selectively focuses on certain factors during data processing while ignoring the remaining factors. It is the main component of a network’s architecture that helps to manage and measure the quantity of interdependence between the input and output elements and within the input elements.
Why Attention is better than the standard sequence-to-sequence model?
The drawback of the seq2seq models was its inability to process long input sequences accurately. This is due to its limitation of considering only the last state of the encoder RNN as the context vector for the decoder. Attention mechanism was introduced as a solution to overcome this problem. During the process of decoding, it retains and uses all the hidden states of the encoder RNN and maps the output of the decoder to all the hidden states of the input sequence.
Types of Attention Models
The attention models can be categorized into two major types: Bahdanau Attention and Luong Attention. The major differences between these models lie in their computations and architecture while the underlying principles remain the same.
Bahdanau Attention
This model is also called as an Additive model and was proposed by Dzmitry Bahdanau in one of his papers that was aimed at improving the seq2seq model in Machine Learning tasks. It attempted to align the decoder with the right input sentences and then implementing the Attention mechanism.
Here’s how the attention mechanism was implemented in Bahdanau’s paper:
Luong Attention
This type is also called Multiplicative Attention and was built on top of the Bahdanau Attention. It was proposed by Thang Luong. The main differences between the two lie in their ability to calculate the alignment scores and the stage at which the Attention mechanism is introduced in the decoder.
Here’s how the attention mechanism was implemented in Luong’s paper:
Looking for a FREE consultation? Reach out to us at sales@datalabeler.com for top-quality data labeling services.