Here we will decode the token IDs back to text using a function (tokenizer.decode) then tokenize it (tokenizer.tokenize) and finally encode it (tokenizer.encode).


# Decode the token IDs back to text
decoded_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
#print decoded text
print(f"Decoded Text: {decoded_text}")
# Tokenize the text again for reference
tokenized_text = tokenizer.tokenize(decoded_text)
#print tokenized text
print(f"tokenized Text: {tokenized_text}")
# Encode the text
encoded_text = tokenizer.encode(text, return_tensors='pt'# Returns a tensor
# Print encoded text
print(f"Encoded Text: {encoded_text}")


Decoded Text: w3wiki is a computer science portal
tokenized Text: ['geek', '##sf', '##org', '##ee', '##ks', 'is', 'a', 'computer', 'science', 'portal']
Encoded Text: tensor([[  101, 29294, 22747, 21759,  4402,  5705,  2003,  1037,  3274,  2671,
          9445,   102]])

If you look into the decoded text which is same with input text but only changed with all lower case as we used bert-base-uncased variant mode. And Encoded text and Input IDs are same as tokenizer.encode and tokenizer.batch_encode_plus both variables produces same sequence of token IDs for a particular input text. As discussed previously BERT can handle out-of-vocabulary(new word to its pre-trained corpus) words which is here ‘w3wiki’. So, it is broken down into sub-word tokens.

Extract and print Word Embeddings

Finally, we will extract the generated word embeddings and print them. Word embeddings are contextual and can capture the meaning of each word present in the sentence. We also print the shape of embedding. We will not print tokens here as it is not needed. If you wish you can also print them by uncommenting the token printing line present in for loop.


# Print word embeddings for each token
for token, embedding in zip(tokenized_text, word_embeddings[0]):
    #print(f"Token: {token}")
    print(f"Embedding: {embedding}")


Embedding: tensor([-2.4299e-01, -2.2849e-01,  5.8441e-02,  5.7861e-03, -4.3398e-01,
        -3.4387e-01,  9.6974e-02,  3.6446e-01, -6.3829e-02, -2.3413e-01,
        -3.2477e-01, -4.9730e-01, -3.0048e-01,  3.5098e-01, -4.8904e-01,
        -1.2836e-01, -5.5042e-01,  4.0802e-02, -3.2041e-01, -1.6057e-01,
Embedding: tensor([-5.9422e-01,  3.0865e-01, -3.5836e-01, -1.6872e-02,  2.9080e-01,
        -5.5942e-01, -2.2233e-01,  7.7186e-01, -8.0256e-01,  2.2205e-01,
        -6.1288e-01, -6.0329e-01, -8.2418e-02,  2.8664e-01, -1.1168e+00,
         1.1978e+00,  6.1283e-02, -3.9820e-01,  1.1269e-01, -7.9150e-01,

It will generate a very large output. A little portion of embedding is provided for understanding purpose. So, the output we have shown the some portions of embeddings of the fast and last token only.

Printing Sentence Embedding

We will also generate sentence embedding by computing average of word embeddings using average pooling.


# Compute the average of word embeddings to get the sentence embedding
sentence_embedding = word_embeddings.mean(dim=1# Average pooling along the sequence length dimension
# Print the sentence embedding
print("Sentence Embedding:")
# Output the shape of the sentence embedding
print(f"Shape of Sentence Embedding: {sentence_embedding.shape}")


Sentence Embedding:
tensor([[-1.2731e-01,  2.3766e-01,  1.6280e-01,  1.7505e-01,  2.1393e-01,
         -7.2085e-01, -1.1638e-01,  5.5303e-01, -2.4897e-01, -3.5929e-02,
         -9.9867e-02, -5.9745e-01, -1.2873e-02,  4.0385e-01, -4.7625e-01,
          9.3286e-02, -3.1485e-01,  1.4257e-02, -3.1248e-01, -1.5662e-01,
         -1.8107e-01, -2.4591e-01, -9.8347e-02,  5.4759e-01,  1.2483e-01,
         -1.1171e-01,  2.2538e-01,  5.8986e-02]])
Shape of Sentence Embedding: torch.Size([1, 768])

It will also generate a large output along with shape of sentence embedding which is [number of sentences, hidden size].

Word embedding is an important part of the NLP process. It is responsible to capture the semantic meaning of words, reduce dimensionality, add contextual information, and promote efficient learning by transferring linguistic knowledge via pre-trained embeddings. As a result, we get enhanced performance with limited task-specific data. In this article, we are going to understand BERT and how it’s going to generate embeddings.

