Decode and Encode the text
Here we will decode the token IDs back to text using a function (tokenizer.decode) then tokenize it (tokenizer.tokenize) and finally encode it (tokenizer.encode).
Python3
# Decode the token IDs back to text decoded_text = tokenizer.decode(input_ids[ 0 ], skip_special_tokens = True ) #print decoded text print (f "Decoded Text: {decoded_text}" ) # Tokenize the text again for reference tokenized_text = tokenizer.tokenize(decoded_text) #print tokenized text print (f "tokenized Text: {tokenized_text}" ) # Encode the text encoded_text = tokenizer.encode(text, return_tensors = 'pt' ) # Returns a tensor # Print encoded text print (f "Encoded Text: {encoded_text}" ) |
Output:
Decoded Text: w3wiki is a computer science portal
tokenized Text: ['geek', '##sf', '##org', '##ee', '##ks', 'is', 'a', 'computer', 'science', 'portal']
Encoded Text: tensor([[ 101, 29294, 22747, 21759, 4402, 5705, 2003, 1037, 3274, 2671,
9445, 102]])
If you look into the decoded text which is same with input text but only changed with all lower case as we used bert-base-uncased variant mode. And Encoded text and Input IDs are same as tokenizer.encode and tokenizer.batch_encode_plus both variables produces same sequence of token IDs for a particular input text. As discussed previously BERT can handle out-of-vocabulary(new word to its pre-trained corpus) words which is here ‘w3wiki’. So, it is broken down into sub-word tokens.
Extract and print Word Embeddings
Finally, we will extract the generated word embeddings and print them. Word embeddings are contextual and can capture the meaning of each word present in the sentence. We also print the shape of embedding. We will not print tokens here as it is not needed. If you wish you can also print them by uncommenting the token printing line present in for loop.
Python3
# Print word embeddings for each token for token, embedding in zip (tokenized_text, word_embeddings[ 0 ]): #print(f"Token: {token}") print (f "Embedding: {embedding}" ) print ( "\n" ) |
Output:
Embedding: tensor([-2.4299e-01, -2.2849e-01, 5.8441e-02, 5.7861e-03, -4.3398e-01,
-3.4387e-01, 9.6974e-02, 3.6446e-01, -6.3829e-02, -2.3413e-01,
-3.2477e-01, -4.9730e-01, -3.0048e-01, 3.5098e-01, -4.8904e-01,
-1.2836e-01, -5.5042e-01, 4.0802e-02, -3.2041e-01, -1.6057e-01,
................................................
......
......
......
......
Embedding: tensor([-5.9422e-01, 3.0865e-01, -3.5836e-01, -1.6872e-02, 2.9080e-01,
-5.5942e-01, -2.2233e-01, 7.7186e-01, -8.0256e-01, 2.2205e-01,
-6.1288e-01, -6.0329e-01, -8.2418e-02, 2.8664e-01, -1.1168e+00,
1.1978e+00, 6.1283e-02, -3.9820e-01, 1.1269e-01, -7.9150e-01,
...................................................
It will generate a very large output. A little portion of embedding is provided for understanding purpose. So, the output we have shown the some portions of embeddings of the fast and last token only.
Printing Sentence Embedding
We will also generate sentence embedding by computing average of word embeddings using average pooling.
Python3
# Compute the average of word embeddings to get the sentence embedding sentence_embedding = word_embeddings.mean(dim = 1 ) # Average pooling along the sequence length dimension # Print the sentence embedding print ( "Sentence Embedding:" ) print (sentence_embedding) # Output the shape of the sentence embedding print (f "Shape of Sentence Embedding: {sentence_embedding.shape}" ) |
Output:
Sentence Embedding:
tensor([[-1.2731e-01, 2.3766e-01, 1.6280e-01, 1.7505e-01, 2.1393e-01,
-7.2085e-01, -1.1638e-01, 5.5303e-01, -2.4897e-01, -3.5929e-02,
-9.9867e-02, -5.9745e-01, -1.2873e-02, 4.0385e-01, -4.7625e-01,
9.3286e-02, -3.1485e-01, 1.4257e-02, -3.1248e-01, -1.5662e-01,
-1.8107e-01, -2.4591e-01, -9.8347e-02, 5.4759e-01, 1.2483e-01,
.......................................
.......................................
-1.1171e-01, 2.2538e-01, 5.8986e-02]])
Shape of Sentence Embedding: torch.Size([1, 768])
It will also generate a large output along with shape of sentence embedding which is [number of sentences, hidden size].
How to Generate Word Embedding using BERT?
Word embedding is an important part of the NLP process. It is responsible to capture the semantic meaning of words, reduce dimensionality, add contextual information, and promote efficient learning by transferring linguistic knowledge via pre-trained embeddings. As a result, we get enhanced performance with limited task-specific data. In this article, we are going to understand BERT and how it’s going to generate embeddings.