Top 10 Pre-Trained NLP Language Models

Aug 12, 2020 5:19:47 PM

Pre-trained NLP model

What is NLP? Natural Language Processing (NLP) is a pre-eminent AI technology that enables machines to read, decipher, understand, and make sense of human languages. From text prediction, sentiment analysis to speech recognition, NLP is allowing the machines to emulate human intelligence and abilities impressively.

For building NLP applications, language models are the key. However, building complex NLP language models from scratch is a tedious task. That is why AI developers and researchers swear by pre-trained language models. These models utilize the transfer learning technique for training wherein a model is trained on one dataset to perform a task. Then the same model is repurposed to perform different NLP functions on a new dataset. 

The pre-trained model solves a specific problem and requires fine-tuning, which saves a lot of time and computational resources to build a new language model. There are several pre-trained NLP models available that are categorized based on the purpose that they serve. Let's take a look at the top 5 pre-trained NLP models.

1. BERT (Bidirectional Encoder Representations from Transformers) 

BERT is a technique for NLP pre-training, developed by Google. It utilizes the Transformer, a novel neural network architecture that’s based on a self-attention mechanism for language understanding. It was developed to address the problem of sequence transduction or neural machine translation. That means, it suits best for any task that transforms an input sequence to an output sequence, such as speech recognition, text-to-speech transformation, etc. 

In its vanilla form, the transformer includes two separate mechanisms: an encoder (which reads the text input) and a decoder (which produces a prediction for the task). The goal of the BERT mechanism is to generate a language model. Thus, only the encoder mechanism is necessary. 

The BERT algorithm is proven to perform 11 NLP tasks efficiently. It’s trained on 2,500 million Wikipedia words and 800 million words of the BookCorpus dataset. Google Search is one of the most excellent examples of BERT’s efficiency. Other applications from Google, such as Google Docs, Gmail Smart Compose utilizes BERT for text prediction. 

2. RoBERTa (Robustly Optimized BERT Pretraining Approach)

RoBERTa is an optimized method for the pre-training of a self-supervised NLP system. It builds the language model on BERT’s language masking strategy that enables the system to learn and predict intentionally hidden sections of text.

RoBERTa modifies the hyperparameters in BERT such as training with larger mini-batches, removing BERT’s next sentence pretraining objective, etc. Pre-trained models like RoBERTa is known to outperform BERT in all individual tasks on the General Language Understanding Evaluation (GLUE) benchmark and can be used for NLP training tasks such as question answering, dialogue systems, document classification, etc.

3. OpenAI’s GPT-3

GPT-3 is a transformer-based NLP model that performs translation, question-answering, poetry composing, cloze tasks, along with tasks that require on-the-fly reasoning such as unscrambling words. Moreover, with its recent advancements, the GPT-3 is used to write news articles and generate codes. 

GPT-3 can manage statistical dependencies between different words. It is trained on over 175 billion parameters on 45 TB of text that’s sourced from all over the internet. With this, it is one of the biggest pre-trained NLP models available. 

What differentiates GPT-3 from other language models is it does not require fine-tuning to perform downstream tasks. With its ‘text in, text out’ API, the developers are allowed to reprogram the model using instructions. 


The increasing size of pre-trained language models helps in improving the performance of downstream tasks. However, as the model size increases, it leads to issues such as longer training times and GPU/TPU memory limitations. To address this problem, Google presented a lite version of BERT (Bidirectional Encoder Representations from Transformers). This model was introduced with two parameter-reduction techniques:

  • Factorized Embedding Parameterization: Here, the size of the hidden layers are separated from the size of vocabulary embeddings. 

  • Cross-Layer Parameter Sharing: This prevents the number of parameters from growing with the depth of the network. 

These parameter reduction techniques help in lowering memory consumption and increase the training speed of the model. Moreover, ALBERT introduces a self-supervised loss for sentence order prediction which is a BERT limitation with regard to inter-sentence coherence.

5. XLNet

Denoising autoencoding based language models such as BERT helps in achieving better performance than an autoregressive model for language modeling. That is why there is XLNet that introduces the auto-regressive pre-training method which offers the following benefits- it enables learning bidirectional context and helps overcome the limitations of BERT with its autoregressive formula. XLNet is known to outperform BERT on 20 tasks, which includes natural language inference, document ranking, sentiment analysis, question answering, etc. 

6. OpenAI’s GPT2

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets. OpenAI’s GPT2 demonstrates that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of web pages called WebText. The model generates coherent paragraphs of text and achieves promising, competitive or state-of-the-art results on a wide variety of tasks.

7. StructBERT

Recently, the pre-trained language model, BERT (and its robustly optimized version RoBERTa), has attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art accuracy in various NLU tasks, such as sentiment classification, natural language inference, semantic textual similarity and question answering. Inspired by the linearization exploration work of Elman, experts have extended BERT to a new model, StructBERT, by incorporating language structures into pre-training. The StructBERT with structural pre-training gives surprisingly good empirical results on a variety of downstream tasks, including pushing the state-of-the-art on the GLUE benchmark to 89.0 (outperforming all published models), the F1 score on SQuAD v1.1 question answering to 93.0, the accuracy on SNLI to 91.7. Like other pre-trained language models, StructBERT may assist businesses with a variety of NLP tasks, including question answering, sentiment analysis, document summarization, etc.

8. T5 (Text-to-Text Transfer Transformer)

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. The Google research team suggests a unified approach to transfer learning in NLP to set a new state of the art in the field. To this end, they propose treating each NLP problem as a “text-to-text” problem. Such a framework allows using the same model, objective, training procedure, and decoding process for different tasks, including summarization, sentiment analysis, question answering, and machine translation. The researchers call their model a Text-to-Text Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state-of-the-art results on several NLP tasks.

9. ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of computing to be effective. As an alternative, experts propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, their approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, experts train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not.

T5 enables the model to learn from all input tokens instead of the small masked-out subset. It is not adversarial, despite the similarity to GAN, as the generator producing tokens for replacement is trained with maximum likelihood. Because of its computational efficiency, the ELECTRA.

10. DeBERTa (Decoding-enhanced BERT with disentangled attention)

The authors from Microsoft Research propose DeBERTa, with two main improvements over BERT, namely disentangled attention and an enhanced mask decoder. DeBERTa has two vectors representing a token/word by encoding content and relative position respectively. The self-attention mechanism in DeBERTa processes self-attention of content-to-content, content-to-position, and also position-to-content, while the self-attention in BERT is equivalent to only have the first two components. The authors hypothesize that position-to-content self-attention is also needed to comprehensively model relative positions in a sequence of tokens. Furthermore, DeBERTa is equipped with an enhanced mask decoder, where the absolute position of the token/word is also given to the decoder along with the relative information. A single scaled-up variant of DeBERTa surpasses the human baseline on the SuperGLUE benchmark for the first time. The ensemble DeBERTa is the top-performing method on SuperGLUE at the time of this publication.

Building an AI Application with Pre-Trained NLP Models

The importance and advantages of pre-trained language models are quite clear. Thankfully, developers have access to these models that helps them to achieve precise output, save resources, and time of AI application development

But, which NLP language model works best for your AI project? Well, the answer to that depends upon the scale of the project, type of dataset, training methodologies, and several other factors. To understand which NLP language model will help your project to achieve maximum accuracy and reduce its time to market, you can connect with our AI experts. 

For that, you can set-up a free consultation session with them wherein they will be guiding you with the right approach to the development of your AI-based application. 

Daffodil Software

Written by Daffodil Software

Daffodil is a software services company that is a proud technology partner to 80+ dynamic organizations across the globe. Our specialty lies in our ability to look beyond technologies and deliver innovative and progressive solutions. We experiment with latest technologies, design approaches and development methodologies to build cutting edge software solutions.