Code Implementation of Negative Sampling for word2vec
1. Importing Neccesary Libraries and Hyperparameters and Corpus
This section sets up the initial parameters required for training the Skip-gram model with negative sampling. It also defines a small example corpus consisting of motivational quotes for training purposes.
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader
# Hyperparameters
embedding_dim = 100
context_size = 2 # Number of context words to use
num_negative_samples = 5 # Number of negative samples per positive sample
learning_rate = 0.001
num_epochs = 5
# Example corpus
corpus = [
"we are what we repeatedly do excellence then is not an act but a habit",
"the only way to do great work is to love what you do",
"if you can dream it you can do it",
"do not wait to strike till the iron is hot but make it hot by striking",
"whether you think you can or you think you cannot you are right",
]
2. Preprocessing the Corpus
The function preprocess_corpus
tokenizes the corpus into individual words and creates a vocabulary from these words. It then maps each word to a unique index and vice versa, which will be used for training the model.
# Preprocess the corpus
def preprocess_corpus(corpus):
words = [word for sentence in corpus for word in sentence.split()]
vocab = set(words)
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}
return words, word_to_idx, idx_to_word
words, word_to_idx, idx_to_word = preprocess_corpus(corpus)
3. Generating Training Data
The function generate_training_data
creates training pairs (target, context) by considering a window of context words around each target word in the corpus. This data will be used to train the Skip-gram model.
# Generate training data
def generate_training_data(words, word_to_idx, context_size):
data = []
for i in range(context_size, len(words) - context_size):
target_word = word_to_idx[words[i]]
context_words = [word_to_idx[words[i - j - 1]] for j in range(context_size)]
context_words += [word_to_idx[words[i + j + 1]] for j in range(context_size)]
for context_word in context_words:
data.append((target_word, context_word))
return data
training_data = generate_training_data(words, word_to_idx, context_size)
4. Custom Dataset Class
A custom PyTorch dataset class,Word2VecDataset
, is defined to handle the training data. This class is then wrapped in a DataLoader to facilitate batching and shuffling during training.
# Custom Dataset class
class Word2VecDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx]
dataset = Word2VecDataset(training_data)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
5. Negative Sampling
The function get_negative_samples
generates negative samples for each target word. These samples are used in the Skip-gram model to improve its performance by teaching it what words should not be predicted as context for a given target.
# Negative Sampling
def get_negative_samples(target, num_negative_samples, vocab_size):
neg_samples = []
while len(neg_samples) < num_negative_samples:
neg_sample = np.random.randint(0, vocab_size)
if neg_sample != target:
neg_samples.append(neg_sample)
return neg_samples
6. Skip-gram Model with Negative Sampling
A PyTorch neural network model, SkipGramNegSampling
, is defined to implement the Skip-gram model with negative sampling. This model includes embeddings for both target and context words and calculates the loss using log-sigmoid functions.
# Skip-gram Model with Negative Sampling
class SkipGramNegSampling(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(SkipGramNegSampling, self).__init__()
self.vocab_size = vocab_size
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.log_sigmoid = nn.LogSigmoid()
def forward(self, target, context, negative_samples):
target_embedding = self.embeddings(target)
context_embedding = self.context_embeddings(context)
negative_embeddings = self.context_embeddings(negative_samples)
positive_score = self.log_sigmoid(torch.sum(target_embedding * context_embedding, dim=1))
negative_score = self.log_sigmoid(-torch.bmm(negative_embeddings, target_embedding.unsqueeze(2)).squeeze(2)).sum(1)
loss = - (positive_score + negative_score).mean()
return loss
7. Training the Model
This section initializes the model and optimizer and then trains the model over several epochs. During each epoch, it processes the training data, computes the loss, and updates the model parameters to minimize the loss.
# Training the model
vocab_size = len(word_to_idx)
model = SkipGramNegSampling(vocab_size, embedding_dim)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
for epoch in range(num_epochs):
total_loss = 0
for target, context in dataloader:
target = target.long()
context = context.long()
negative_samples = torch.LongTensor([get_negative_samples(t.item(), num_negative_samples, vocab_size) for t in target])
optimizer.zero_grad()
loss = model(target, context, negative_samples)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch + 1}, Loss: {total_loss / len(dataloader)}")
8. Getting Word Embeddings and Finding Similar Words
After training, the word embeddings are extracted from the model. A function get_similar_words
is defined to find words with similar embeddings to a given word, based on cosine similarity. The code then demonstrates how to find similar words for the word “do”.
# Getting the word embeddings
embeddings = model.embeddings.weight.detach().numpy()
# Function to get similar words
def get_similar_words(word, top_n=5):
idx = word_to_idx[word]
word_embedding = embeddings[idx]
similarities = np.dot(embeddings, word_embedding)
closest_idxs = (-similarities).argsort()[1:top_n+1]
return [idx_to_word[idx] for idx in closest_idxs]
# Example usage
print(get_similar_words("do"))
Output:
['dream', 'right', 'hot', 'if', 'strike']
Advantages of Negative Sampling
- Computational Efficiency: By reducing the number of words whose weights are updated, negative sampling makes the training of large-scale models feasible.
- Scalability: It enables the training of word embeddings on very large corpora with extensive vocabularies.
- Improved Performance: Negative sampling often leads to better word embeddings by focusing on distinguishing true context pairs from random pairs, which helps in capturing the semantic relationships more effectively.
Negaitve Sampling Using word2vec
Word2Vec, developed by Tomas Mikolov and colleagues at Google, has revolutionized natural language processing by transforming words into meaningful vector representations. Among the key innovations that made Word2Vec both efficient and effective is the technique of negative sampling. This article delves into what negative sampling is, why it’s crucial, and how it works within the Word2Vec framework.