Implementing Modeling using Latent Dirichlet Allocation

Step 1: Install Necessary Libraries

This step involves installing the required libraries for text processing and topic modeling, including pandas, gensim, spacy, nltk, and matplotlib.

!pip install pandas gensim spacy nltk matplotlib

Step 2: Create and Save Sample Dataset

In this step, we create a sample dataset containing a text column and save it to a CSV file. The sample dataset consists of a list of 10 text entries, each containing a short sentence.

import pandas as pd

# Create a sample dataset
data = {
    'text_column': [
        'The cat sat on the mat.',
        'Dogs are great pets.',
        'I love to play football.',
        'Data science is an interdisciplinary field.',
        'Python is a great programming language.',
        'Machine learning is a subset of artificial intelligence.',
        'Artificial intelligence and machine learning are popular topics.',
        'Deep learning is a type of machine learning.',
        'Natural language processing involves analyzing text data.',
        'I enjoy hiking and outdoor activities.'
    ]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('sample_dataset.csv', index=False)

Step 3: Load Dataset

Load the sample dataset from the CSV file into a DataFrame.

import pandas as pd

# Load data
data = pd.read_csv('sample_dataset.csv')

Step 4: Preprocess Text Data

This step involves cleaning the text data by removing extra spaces, emails, apostrophes, and non-alphabet characters, and converting the text to lowercase.

import re

# Preprocess the text data
def preprocess_text(text):
    text = re.sub('\s+', ' ', text)  # Remove extra spaces
    text = re.sub('\S*@\S*\s?', '', text)  # Remove emails
    text = re.sub('\'', '', text)  # Remove apostrophes
    text = re.sub('[^a-zA-Z]', ' ', text)  # Remove non-alphabet characters
    text = text.lower()  # Convert to lowercase
    return text

data['cleaned_text'] = data['text_column'].apply(preprocess_text)

Step 5: Tokenize and Remove Stopwords

Tokenize the cleaned text data and remove stopwords using NLTK.

import gensim
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

# Tokenize and remove stopwords
def tokenize(text):
    tokens = gensim.utils.simple_preprocess(text, deacc=True)
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

data['tokens'] = data['cleaned_text'].apply(tokenize)

Step 6: Lemmatize Tokens

Lemmatize the tokens using spaCy.

import spacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatize(tokens):
    doc = nlp(" ".join(tokens))
    return [token.lemma_ for token in doc]

data['lemmas'] = data['tokens'].apply(lemmatize)

Step 7: Create Dictionary and Corpus

Create a dictionary and corpus from the lemmatized tokens.

import gensim.corpora as corpora

# Create dictionary and corpus
id2word = corpora.Dictionary(data['lemmas'])
texts = data['lemmas']
corpus = [id2word.doc2bow(text) for text in texts]

Step 8: Build LDA Model

Build an LDA model with the specified number of topics.

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=3, 
                                            random_state=100,
                                            update_every=1,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

Step 9: Print Topics

Print the topics generated by the LDA model.

# Print the topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

Step 10: Compute Coherence Score

Compute the coherence score to evaluate the quality of the topics.

from gensim.models import CoherenceModel

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data['lemmas'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Complete implementation of Topic Modeling using LDA

Python

import pandas as pd

# Create a sample dataset
data = {
    'text_column': [
        'The cat sat on the mat.',
        'Dogs are great pets.',
        'I love to play football.',
        'Data science is an interdisciplinary field.',
        'Python is a great programming language.',
        'Machine learning is a subset of artificial intelligence.',
        'Artificial intelligence and machine learning are popular topics.',
        'Deep learning is a type of machine learning.',
        'Natural language processing involves analyzing text data.',
        'I enjoy hiking and outdoor activities.'
    ]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('sample_dataset.csv', index=False)

import pandas as pd
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy
import nltk
from nltk.corpus import stopwords
import re
import matplotlib.pyplot as plt

# Download NLTK stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

# Load data
data = pd.read_csv('sample_dataset.csv')  # Load the sample dataset

# Preprocess the text data
def preprocess_text(text):
    text = re.sub('\s+', ' ', text)  # Remove extra spaces
    text = re.sub('\S*@\S*\s?', '', text)  # Remove emails
    text = re.sub('\'', '', text)  # Remove apostrophes
    text = re.sub('[^a-zA-Z]', ' ', text)  # Remove non-alphabet characters
    text = text.lower()  # Convert to lowercase
    return text

data['cleaned_text'] = data['text_column'].apply(preprocess_text)  # Replace 'text_column' with your column name

# Tokenize and remove stopwords
def tokenize(text):
    tokens = gensim.utils.simple_preprocess(text, deacc=True)
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

data['tokens'] = data['cleaned_text'].apply(tokenize)

# Lemmatization using spaCy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def lemmatize(tokens):
    doc = nlp(" ".join(tokens))
    return [token.lemma_ for token in doc]

data['lemmas'] = data['tokens'].apply(lemmatize)

# Create dictionary and corpus
id2word = corpora.Dictionary(data['lemmas'])
texts = data['lemmas']
corpus = [id2word.doc2bow(text) for text in texts]

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=3, 
                                            random_state=100,
                                            update_every=1,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

# Print the topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data['lemmas'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Output:

(0, '0.115*"learning" + 0.066*"type" + 0.066*"programming" + 0.066*"python" + 0.066*"deep" + 0.066*"great" + 0.065*"language" + 0.065*"machine" + 0.016*"datum" + 0.016*"love"')
(1, '0.062*"outdoor" + 0.062*"activity" + 0.062*"football" + 0.062*"enjoy" + 0.062*"cat" + 0.062*"play" + 0.062*"hike" + 0.062*"mat" + 0.062*"sit" + 0.062*"love"')
(2, '0.066*"machine" + 0.066*"datum" + 0.066*"artificial" + 0.066*"intelligence" + 0.038*"language" + 0.038*"great" + 0.038*"learning" + 0.038*"popular" + 0.038*"learn" + 0.038*"processing"')

Coherence Score:  0.5839748062472863

The output shows three topics, each represented by a list of words with associated weights, indicating the importance of each word in that topic. The coherence score, which is 0.5839748062472863, measures the interpretability of the topics. Higher scores generally indicate more coherent and interpretable topics.

The coherence score of 0.5839748062472863 suggests that the topics are reasonably coherent and interpretable, although there might still be room for improvement. Coherence scores range from 0 to 1, with higher scores indicating better topic quality.

Topic Modeling Using Latent Dirichlet Allocation (LDA)

In the era of information explosion, extracting meaningful insights from large collections of text data has become increasingly important. Topic modeling is a powerful technique for uncovering hidden themes or topics within a corpus of documents. Among the various methods available, Latent Dirichlet Allocation (LDA) stands out as one of the most popular and effective algorithms for topic modeling.

This article delves into what LDA is, the fundamentals of topic modeling, and its applications, and concludes with a summary of its significance.

Implementing Modeling using Latent Dirichlet Allocation

Step 1: Install Necessary Libraries

Step 2: Create and Save Sample Dataset

Step 3: Load Dataset

Step 4: Preprocess Text Data

Step 5: Tokenize and Remove Stopwords

Step 6: Lemmatize Tokens

Step 7: Create Dictionary and Corpus

Step 8: Build LDA Model

Step 9: Print Topics

Step 10: Compute Coherence Score

Complete implementation of Topic Modeling using LDA

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Categories

Contact US

Implementing Modeling using Latent Dirichlet Allocation

Step 1: Install Necessary Libraries

Step 2: Create and Save Sample Dataset

Step 3: Load Dataset

Step 4: Preprocess Text Data

Step 5: Tokenize and Remove Stopwords

Step 6: Lemmatize Tokens

Step 7: Create Dictionary and Corpus

Step 8: Build LDA Model

Step 9: Print Topics

Step 10: Compute Coherence Score

Complete implementation of Topic Modeling using LDA

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Similar Reads

Categories

Contact US