How BERT work?

BERT is designed to generate a language model so, only the encoder mechanism is used. Sequence of tokens are fed to the Transformer encoder. These tokens are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors, each corresponding to an input token, providing contextualized representations.

When training language models, defining a prediction goal is a challenge. Many models predict the next word in a sequence, which is a directional approach and may limit context learning. BERT addresses this challenge with two innovative training strategies:

  1. Masked Language Model (MLM)
  2. Next Sentence Prediction (NSP)

1. Masked Language Model (MLM)

In BERT’s pre-training process, a portion of words in each input sequence is masked and the model is trained to predict the original values of these masked words based on the context provided by the surrounding words.

In simple terms,

  1. Masking words: Before BERT learns from sentences, it hides some words (about 15%) and replaces them with a special symbol, like [MASK].
  2. Guessing Hidden Words: BERT’s job is to figure out what these hidden words are by looking at the words around them. It’s like a game of guessing where some words are missing, and BERT tries to fill in the blanks.
  3. How BERT learns:
    • BERT adds a special layer on top of its learning system to make these guesses. It then checks how close its guesses are to the actual hidden words.
    • It does this by converting its guesses into probabilities, saying, “I think this word is X, and I’m this much sure about it.”
  4. Special Attention to Hidden Words
    • BERT’s main focus during training is on getting these hidden words right. It cares less about predicting the words that are not hidden.
    • This is because the real challenge is figuring out the missing parts, and this strategy helps BERT become really good at understanding the meaning and context of words.

In technical terms,

  1. BERT adds a classification layer on top of the output from the encoder. This layer is crucial for predicting the masked words.
  2. The output vectors from the classification layer are multiplied by the embedding matrix, transforming them into the vocabulary dimension. This step helps align the predicted representations with the vocabulary space.
  3. The probability of each word in the vocabulary is calculated using the SoftMax activation function. This step generates a probability distribution over the entire vocabulary for each masked position.
  4. The loss function used during training considers only the prediction of the masked values. The model is penalized for the deviation between its predictions and the actual values of the masked words.
  5. The model converges slower than directional models. This is because, during training, BERT is only concerned with predicting the masked values, ignoring the prediction of the non-masked words. The increased context awareness achieved through this strategy compensates for the slower convergence.

2. Next Sentence Prediction (NSP)

BERT predicts if the second sentence is connected to the first. This is done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax.

  1. In the training process, BERT learns to understand the relationship between pairs of sentences, predicting if the second sentence follows the first in the original document.
  2. 50% of the input pairs have the second sentence as the subsequent sentence in the original document, and the other 50% have a randomly chosen sentence.
  3. To help the model distinguish between connected and disconnected sentence pairs. The input is processed before entering the model:
    • A [CLS] token is inserted at the beginning of the first sentence, and a [SEP] token is added at the end of each sentence.
    • A sentence embedding indicating Sentence A or Sentence B is added to each token.
    • A positional embedding indicates the position of each token in the sequence.
  4. BERT predicts if the second sentence is connected to the first. This is done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax.

During the training of BERT model, the Masked LM and Next Sentence Prediction are trained together. The model aims to minimize the combined loss function of the Masked LM and Next Sentence Prediction, leading to a robust language model with enhanced capabilities in understanding context within sentences and relationships between sentences.

Why to train Masked LM and Next Sentence Prediction together?

Masked LM helps BERT to understand the context within a sentence and Next Sentence Prediction helps BERT grasp the connection or relationship between pairs of sentences. Hence, training both the strategies together ensures that BERT learns a broad and comprehensive understanding of language, capturing both details within sentences and the flow between sentences.

Explanation of BERT Model – NLP

BERT, an acronym for Bidirectional Encoder Representations from Transformers, stands as an open-source machine learning framework designed for the realm of natural language processing (NLP). Originating in 2018, this framework was crafted by researchers from Google AI Language. The article aims to explore the architecture, working and applications of BERT.

Similar Reads

What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) leverages a transformer-based neural network to understand and generate human-like language. BERT employs an encoder-only architecture. In the original Transformer architecture, there are both encoder and decoder modules. The decision to use an encoder-only architecture in BERT suggests a primary emphasis on understanding input sequences rather than generating output sequences....

How BERT work?

BERT is designed to generate a language model so, only the encoder mechanism is used. Sequence of tokens are fed to the Transformer encoder. These tokens are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors, each corresponding to an input token, providing contextualized representations....

BERT Architectures

The architecture of BERT is a multilayer bidirectional transformer encoder which is quite similar to the transformer model. A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side....

How to use BERT model in NLP?

BERT can be used for various natural language processing (NLP) tasks such as:...

How to Tokenize and Encode Text using BERT?

To tokenize and encode text using BERT, we will be using the ‘transformer’ library in Python....

Application of BERT

...

BERT vs GPT

BERT is used for:...

Frequently Asked Questions (FAQs)

The difference between BERTand GPT are as follows:...