Transformers – A Deep Learning Model for NLP


The Transformer is a Deep Learning Model that was introduced in 2017 and is mainly used for Natural Language Processing Tasks. It is mainly designed to handle sequential data for carrying out tasks such as text summarization and translation.

Let’s take a deep dive into its architecture and why it is considered better than the Recurrent Neural Networks.

Encoder & Decoder Architecture

Transformers has an encoder-decoder architecture. The encoder consists of two important components; a feed-forward neural network and a self-attention mechanism. The decoder consists of three important components; a feed-forward neural network, a self-attention mechanism, and an attention mechanism over the encodings

Both encoder and decoder are modular, having modules that can be stacked one on top of each other multiple times. Each encoder module processes the input to generate encodings which are then passed as inputs onto the next encoder module. The encodings generally contain information on the parts of the inputs that are relevant to each other.

The decoder modules on the other hand process the encodings and generates an output sequence by using the contextual information incorporated within the encodings. Each of the encoder and decoder layers uses the attention mechanism to weigh the relevancy of every input and extracts information from them accordingly to generate the output. Each decoder layer comes with an additional attention mechanism that helps to extract information from the outputs of previous decoders. This takes place before the decoder can even draw information from the encodings. Both the encoder and decoder layers rely on a feed-forward neural network for additional processing of the output.

Why Transformers Are Preferred Over RNNs?

Most of the Natural Language Processing systems till recently were dependent on gated recurring neural networks (RNNs) such as Long short-term memory (LSTMs) and gated recurrent units (GRUs) having additional attention mechanisms. But after the introduction of Transformers, it has started to replace the older RNNs like LSTMs.

Even though both RNNs and Transformers can handle sequential data, unlike the former, the latter doesn’t require the sequential data to be processed in the order. This means when a transformer model is processing a natural language sentence, it doesn’t have to process it from the beginning. Hence, Transformers allows for more parallelization when compared to RNNs, and therefore requires less training.

The transformers were built using attention technologies without using an RNN structure. This highlights the fact that the attention mechanism alone minus the recurrent sequential processing can achieve the performance of RNNs.

Since Transformers facilitate more parallelization than older RNNs, it can easily enable training on larger datasets thereby making the development of pre-trained systems possible such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-Trained Transformer (GPT). These systems were trained using larger datasets of general language and as a result, can be customized to perform specific language tasks.

Trust Data Labeler with All Your Human Data Annotations Needs

Data Labeler specializes in building comprehensive datasets that are perfect for training your ML models. Even though Data Annotation is a very significant part of your AI/ML undertaking, you don’t have to worry about spending time annotating data yourself. We will do the heavy weight-lifting part while you focus on optimizing your AI/ML models to perfection. Write to us at for customized training datasets for your AI/ML projects.

Transformers – A Deep Learning Model for NLP

Leave a Reply

Your email address will not be published. Required fields are marked *