Visualization of Word Embeddings using t-SNE
Visualizing word embeddings can provide insights into how words are positioned relative to each other in a high-dimensional space. In this code, we demonstrate how to visualize word embeddings using t-SNE (t-distributed Stochastic Neighbor Embedding), a technique for dimensionality reduction, after training a Word2Vec model on the ‘text8’ corpus.
Code Steps:
- Import necessary libraries.
- Load the ‘text8’ corpus.
- Train a Word2Vec model on the corpus.
- Define sample words for visualization.
- Filter words existing in the model’s vocabulary.
- Retrieve word embeddings for sample words.
- Convert embeddings to a numpy array.
- Print original embedding vector shape.
- Use t-SNE to reduce embeddings to 2D.
- Print the shape of reduced embeddings.
- Plot word embeddings using Matplotlib.
- Set plot attributes.
- Save the plot as an image file.
- Display the plot.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import gensim.downloader as api
from gensim.models import Word2Vec
# Load the text8 corpus from gensim
corpus = api.load('text8')
# Train a Word2Vec model on the text8 corpus
model = Word2Vec(corpus)
# Sample words for visualization
words = ['cat', 'dog', 'elephant', 'lion', 'bird', 'rat', 'wolf', 'cow',
'goat', 'snake', 'rabbit', 'human', 'parrot', 'fox', 'peacock',
'lotus', 'roses', 'marigold', 'jasmine', 'computer', 'robot',
'software', 'vocabulary', 'machine', 'eye', 'vision',
'grammar', 'words', 'sentences', 'language', 'verbs', 'noun',
'transformer', 'embedding', 'neural', 'network', 'optimization']
# Filter words that exist in the model's vocabulary
words = [word for word in words if word in model.wv.key_to_index]
# Get word embeddings for sample words from the pre-trained model
word_embeddings = [model.wv[word] for word in words]
# Convert word embeddings to a numpy array
embeddings = np.array(word_embeddings)
# Print original embedding vector shape
print('Original embedding vector shape', embeddings.shape)
# Use t-SNE to reduce dimensionality to 2D with reduced perplexity
tsne = TSNE(n_components=2, perplexity=2) # Reduced perplexity value
embeddings_2d = tsne.fit_transform(embeddings)
# Print the shape of the embeddings after applying t-SNE
print('After applying t-SNE, embedding vector shape', embeddings_2d.shape)
# Plot the word embedding graph
# Set figure size and DPI for high-resolution output
plt.figure(figsize=(10, 7), dpi=1000)
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], marker='o')
# Add labels to data points
for i, word in enumerate(words):
plt.text(embeddings_2d[i, 0], embeddings_2d[i, 1], word,
fontsize=10, ha='left', va='bottom') # Adjust text placement for better readability
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.title('Word Embedding Graph (t-SNE with Word2Vec)')
plt.grid(True)
plt.savefig('embedding.png') # Save the plot as an image file
plt.show()
Output:
Original embedding vector shape (37, 100)
After applying tsne embedding vector shape (37, 2)
What are Embedding in Machine Learning?
In recent years, embeddings have emerged as a core idea in machine learning, revolutionizing the way we represent and understand data. In this article, we delve into the world of embeddings, exploring their importance, applications, and the underlying techniques used to generate them.
Table of Content
- What are Embedding?
- Key terms used for Embedding
- Why Embedding is so important?
- What Object can be embedded?
- How do embeddings work?
- Visualization of Word Embeddings using t-SNE
- Frequently Asked Questions on Embedding