Decode and Encode the text

Here we will decode the token IDs back to text using a function (tokenizer.decode) then tokenize it (tokenizer.tokenize) and finally encode it (tokenizer.encode).

Python3




# Decode the token IDs back to text
decoded_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
#print decoded text
print(f"Decoded Text: {decoded_text}")
# Tokenize the text again for reference
tokenized_text = tokenizer.tokenize(decoded_text)
#print tokenized text
print(f"tokenized Text: {tokenized_text}")
# Encode the text
encoded_text = tokenizer.encode(text, return_tensors='pt'# Returns a tensor
# Print encoded text
print(f"Encoded Text: {encoded_text}")


Output:

Decoded Text: w3wiki is a computer science portal
tokenized Text: ['geek', '##sf', '##org', '##ee', '##ks', 'is', 'a', 'computer', 'science', 'portal']
Encoded Text: tensor([[  101, 29294, 22747, 21759,  4402,  5705,  2003,  1037,  3274,  2671,
          9445,   102]])

If you look into the decoded text which is same with input text but only changed with all lower case as we used bert-base-uncased variant mode. And Encoded text and Input IDs are same as tokenizer.encode and tokenizer.batch_encode_plus both variables produces same sequence of token IDs for a particular input text. As discussed previously BERT can handle out-of-vocabulary(new word to its pre-trained corpus) words which is here ‘w3wiki’. So, it is broken down into sub-word tokens.

Extract and print Word Embeddings

Finally, we will extract the generated word embeddings and print them. Word embeddings are contextual and can capture the meaning of each word present in the sentence. We also print the shape of embedding. We will not print tokens here as it is not needed. If you wish you can also print them by uncommenting the token printing line present in for loop.

Python3




# Print word embeddings for each token
for token, embedding in zip(tokenized_text, word_embeddings[0]):
    #print(f"Token: {token}")
    print(f"Embedding: {embedding}")
    print("\n")


Output:

Embedding: tensor([-2.4299e-01, -2.2849e-01,  5.8441e-02,  5.7861e-03, -4.3398e-01,
        -3.4387e-01,  9.6974e-02,  3.6446e-01, -6.3829e-02, -2.3413e-01,
        -3.2477e-01, -4.9730e-01, -3.0048e-01,  3.5098e-01, -4.8904e-01,
        -1.2836e-01, -5.5042e-01,  4.0802e-02, -3.2041e-01, -1.6057e-01,
          ................................................
......
......
......
......
Embedding: tensor([-5.9422e-01,  3.0865e-01, -3.5836e-01, -1.6872e-02,  2.9080e-01,
        -5.5942e-01, -2.2233e-01,  7.7186e-01, -8.0256e-01,  2.2205e-01,
        -6.1288e-01, -6.0329e-01, -8.2418e-02,  2.8664e-01, -1.1168e+00,
         1.1978e+00,  6.1283e-02, -3.9820e-01,  1.1269e-01, -7.9150e-01,
         ...................................................

It will generate a very large output. A little portion of embedding is provided for understanding purpose. So, the output we have shown the some portions of embeddings of the fast and last token only.

Printing Sentence Embedding

We will also generate sentence embedding by computing average of word embeddings using average pooling.

Python3




# Compute the average of word embeddings to get the sentence embedding
sentence_embedding = word_embeddings.mean(dim=1# Average pooling along the sequence length dimension
 
# Print the sentence embedding
print("Sentence Embedding:")
print(sentence_embedding)
 
# Output the shape of the sentence embedding
print(f"Shape of Sentence Embedding: {sentence_embedding.shape}")


Output:

Sentence Embedding:
tensor([[-1.2731e-01,  2.3766e-01,  1.6280e-01,  1.7505e-01,  2.1393e-01,
         -7.2085e-01, -1.1638e-01,  5.5303e-01, -2.4897e-01, -3.5929e-02,
         -9.9867e-02, -5.9745e-01, -1.2873e-02,  4.0385e-01, -4.7625e-01,
          9.3286e-02, -3.1485e-01,  1.4257e-02, -3.1248e-01, -1.5662e-01,
         -1.8107e-01, -2.4591e-01, -9.8347e-02,  5.4759e-01,  1.2483e-01,
           .......................................
           .......................................
         -1.1171e-01,  2.2538e-01,  5.8986e-02]])
Shape of Sentence Embedding: torch.Size([1, 768])

It will also generate a large output along with shape of sentence embedding which is [number of sentences, hidden size].

How to Generate Word Embedding using BERT?

Word embedding is an important part of the NLP process. It is responsible to capture the semantic meaning of words, reduce dimensionality, add contextual information, and promote efficient learning by transferring linguistic knowledge via pre-trained embeddings. As a result, we get enhanced performance with limited task-specific data. In this article, we are going to understand BERT and how it’s going to generate embeddings.

Similar Reads

What is word embedding?

Word embedding is an unsupervised method required for various Natural Language Processing (NLP) tasks like text classification, sentiment analysis, etc. Generating word embeddings from Bidirectional Encoder Representations from Transformers (BERT) is an efficient technique. BERT can be commonly referred to as a pre-trained language model, which can also be used for NLP tasks by fine-tuning....

Some of the popular word-embedding techniques

There are some well-known word-embedding techniques are discussed below:...

Architecture of BERT

BERT is a commonly used state-of-the-art deep learning model for various NLP tasks. We will explore its architecture below:...

Why do we need to use BERT

There are several reasons which made BERT a common choice for NLP tasks. The reasons are discussed below:...

How BERT is better than Word2vec?

BERT and Word2vec both are famous for generating word-embeddings for different NLP tasks. But somehow BERT outperforms over Word2vec. The reasons are discussed below:...

Step-by-step implementation

Installing transformers module...

Decode and Encode the text

...

Computing similarity metrics

...

Conclusion

...