Implementing Modeling using Latent Dirichlet Allocation
Step 1: Install Necessary Libraries
This step involves installing the required libraries for text processing and topic modeling, including pandas
, gensim
, spacy
, nltk
, and matplotlib
.
!pip install pandas gensim spacy nltk matplotlib
Step 2: Create and Save Sample Dataset
In this step, we create a sample dataset containing a text column and save it to a CSV file. The sample dataset consists of a list of 10 text entries, each containing a short sentence.
import pandas as pd
# Create a sample dataset
data = {
'text_column': [
'The cat sat on the mat.',
'Dogs are great pets.',
'I love to play football.',
'Data science is an interdisciplinary field.',
'Python is a great programming language.',
'Machine learning is a subset of artificial intelligence.',
'Artificial intelligence and machine learning are popular topics.',
'Deep learning is a type of machine learning.',
'Natural language processing involves analyzing text data.',
'I enjoy hiking and outdoor activities.'
]
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Save DataFrame to CSV
df.to_csv('sample_dataset.csv', index=False)
Step 3: Load Dataset
Load the sample dataset from the CSV file into a DataFrame.
import pandas as pd
# Load data
data = pd.read_csv('sample_dataset.csv')
Step 4: Preprocess Text Data
This step involves cleaning the text data by removing extra spaces, emails, apostrophes, and non-alphabet characters, and converting the text to lowercase.
import re
# Preprocess the text data
def preprocess_text(text):
text = re.sub('\s+', ' ', text) # Remove extra spaces
text = re.sub('\S*@\S*\s?', '', text) # Remove emails
text = re.sub('\'', '', text) # Remove apostrophes
text = re.sub('[^a-zA-Z]', ' ', text) # Remove non-alphabet characters
text = text.lower() # Convert to lowercase
return text
data['cleaned_text'] = data['text_column'].apply(preprocess_text)
Step 5: Tokenize and Remove Stopwords
Tokenize the cleaned text data and remove stopwords using NLTK.
import gensim
import nltk
from nltk.corpus import stopwords
# Download NLTK stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
# Tokenize and remove stopwords
def tokenize(text):
tokens = gensim.utils.simple_preprocess(text, deacc=True)
tokens = [token for token in tokens if token not in stop_words]
return tokens
data['tokens'] = data['cleaned_text'].apply(tokenize)
Step 6: Lemmatize Tokens
Lemmatize the tokens using spaCy.
import spacy
# Load spaCy model
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def lemmatize(tokens):
doc = nlp(" ".join(tokens))
return [token.lemma_ for token in doc]
data['lemmas'] = data['tokens'].apply(lemmatize)
Step 7: Create Dictionary and Corpus
Create a dictionary and corpus from the lemmatized tokens.
import gensim.corpora as corpora
# Create dictionary and corpus
id2word = corpora.Dictionary(data['lemmas'])
texts = data['lemmas']
corpus = [id2word.doc2bow(text) for text in texts]
Step 8: Build LDA Model
Build an LDA model with the specified number of topics.
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=3,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
Step 9: Print Topics
Print the topics generated by the LDA model.
# Print the topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
print(topic)
Step 10: Compute Coherence Score
Compute the coherence score to evaluate the quality of the topics.
from gensim.models import CoherenceModel
# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data['lemmas'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Complete implementation of Topic Modeling using LDA
import pandas as pd
# Create a sample dataset
data = {
'text_column': [
'The cat sat on the mat.',
'Dogs are great pets.',
'I love to play football.',
'Data science is an interdisciplinary field.',
'Python is a great programming language.',
'Machine learning is a subset of artificial intelligence.',
'Artificial intelligence and machine learning are popular topics.',
'Deep learning is a type of machine learning.',
'Natural language processing involves analyzing text data.',
'I enjoy hiking and outdoor activities.'
]
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Save DataFrame to CSV
df.to_csv('sample_dataset.csv', index=False)
import pandas as pd
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy
import nltk
from nltk.corpus import stopwords
import re
import matplotlib.pyplot as plt
# Download NLTK stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
# Load data
data = pd.read_csv('sample_dataset.csv') # Load the sample dataset
# Preprocess the text data
def preprocess_text(text):
text = re.sub('\s+', ' ', text) # Remove extra spaces
text = re.sub('\S*@\S*\s?', '', text) # Remove emails
text = re.sub('\'', '', text) # Remove apostrophes
text = re.sub('[^a-zA-Z]', ' ', text) # Remove non-alphabet characters
text = text.lower() # Convert to lowercase
return text
data['cleaned_text'] = data['text_column'].apply(preprocess_text) # Replace 'text_column' with your column name
# Tokenize and remove stopwords
def tokenize(text):
tokens = gensim.utils.simple_preprocess(text, deacc=True)
tokens = [token for token in tokens if token not in stop_words]
return tokens
data['tokens'] = data['cleaned_text'].apply(tokenize)
# Lemmatization using spaCy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def lemmatize(tokens):
doc = nlp(" ".join(tokens))
return [token.lemma_ for token in doc]
data['lemmas'] = data['tokens'].apply(lemmatize)
# Create dictionary and corpus
id2word = corpora.Dictionary(data['lemmas'])
texts = data['lemmas']
corpus = [id2word.doc2bow(text) for text in texts]
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=3,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
# Print the topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
print(topic)
# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data['lemmas'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Output:
(0, '0.115*"learning" + 0.066*"type" + 0.066*"programming" + 0.066*"python" + 0.066*"deep" + 0.066*"great" + 0.065*"language" + 0.065*"machine" + 0.016*"datum" + 0.016*"love"')
(1, '0.062*"outdoor" + 0.062*"activity" + 0.062*"football" + 0.062*"enjoy" + 0.062*"cat" + 0.062*"play" + 0.062*"hike" + 0.062*"mat" + 0.062*"sit" + 0.062*"love"')
(2, '0.066*"machine" + 0.066*"datum" + 0.066*"artificial" + 0.066*"intelligence" + 0.038*"language" + 0.038*"great" + 0.038*"learning" + 0.038*"popular" + 0.038*"learn" + 0.038*"processing"')
Coherence Score: 0.5839748062472863
The output shows three topics, each represented by a list of words with associated weights, indicating the importance of each word in that topic. The coherence score, which is 0.5839748062472863, measures the interpretability of the topics. Higher scores generally indicate more coherent and interpretable topics.
The coherence score of 0.5839748062472863 suggests that the topics are reasonably coherent and interpretable, although there might still be room for improvement. Coherence scores range from 0 to 1, with higher scores indicating better topic quality.
Topic Modeling Using Latent Dirichlet Allocation (LDA)
In the era of information explosion, extracting meaningful insights from large collections of text data has become increasingly important. Topic modeling is a powerful technique for uncovering hidden themes or topics within a corpus of documents. Among the various methods available, Latent Dirichlet Allocation (LDA) stands out as one of the most popular and effective algorithms for topic modeling.
This article delves into what LDA is, the fundamentals of topic modeling, and its applications, and concludes with a summary of its significance.