Implementation to classify text documents using Naive Bayes

Importing Libraries

Python3

#importing libraries
import prettytable

                    

The “prettytable” library is imported by the code snippet, indicating a desire to provide tabular data that is aesthetically pleasing. This library is frequently used to present structured data in a table with formatting. Once imported, you can use its features to improve how tabular data is presented in your Python code.

Classification using Naive Bayes

Python3

print('\n *-----* Classification using Naïve bayes *-----* \n')
total_documents = int(input("Enter the Total Number of documents: "))
doc_class = []
i = 0
keywords = []
while not i == total_documents:
    doc_class.append([])
    text = input(f"\nEnter the text of Doc-{i+1} : ").lower()
    clas = input(f"Enter the class of Doc-{i+1} : ")
    doc_class[i].append(text.split())
    doc_class[i].append(clas)
    keywords.extend(text.split())
    i = i+1
keywords = set(keywords)
keywords = list(keywords)
keywords.sort()
to_find = input(
    "\nEnter the Text to classify using Naive Bayes: ").lower().split()
 
probability_table = []
for i in range(total_documents):
    probability_table.append([])
    for j in keywords:
        probability_table[i].append(0)
doc_id = 1
for i in range(total_documents):
    for k in range(len(keywords)):
        if keywords[k] in doc_class[i][0]:
            probability_table[i][k] += doc_class[i][0].count(keywords[k])
print('\n')

                    

Output:

 *-----* Classification using Naïve bayes *-----* 
Enter the Total Number of documents: 3
Enter the text of Doc-1 : I watched the movie.
Enter the class of Doc-1 : +
Enter the text of Doc-2 : I hated the movie.
Enter the class of Doc-2 : -
Enter the text of Doc-3 : poor acting.
Enter the class of Doc-3 : +
Enter the Text to classify using Naive Bayes: I hated the acting.


This code starts a basic Naive Bayes text classification. The user is prompted to enter the total number of documents, after which it collects details about each document, such as its text and class. After gathering the unique terms (keywords) that appear in every document, a probability table is created to count how many times each keyword appears in every document. When the user submits a text for classification, the likelihood that it belongs in each class is calculated based on the frequency of the term in the training materials. There’s a probability table with the outcomes.

Probability of Documents

Python3

import prettytable
keywords.insert(0, 'Document ID')
keywords.append("Class")
Prob_Table = prettytable.PrettyTable()
Prob_Table.field_names = keywords
Prob_Table.title = 'Probability of Documents'
x = 0
for i in probability_table:
    i.insert(0, x+1)
    i.append(doc_class[x][1])
    Prob_Table.add_row(i)
    x = x+1
print(Prob_Table)
print('\n')
for i in probability_table:
    i.pop(0)
totalpluswords = 0
totalnegwords = 0
totalplus = 0
totalneg = 0
vocabulary = len(keywords)-2
for i in probability_table:
    if i[len(i)-1] == "+":
        totalplus += 1
        totalpluswords += sum(i[0:len(i)-1])
    else:
        totalneg += 1
        totalnegwords += sum(i[0:len(i)-1])
keywords.pop(0)
keywords.pop(len(keywords)-1)

                    

Output:

+---------------------------------------------------------------------------+
| Probability of Documents |
+-------------+---------+-------+---+--------+------+-----+---------+-------+
| Document ID | acting. | hated | i | movie. | poor | the | watched | Class |
+-------------+---------+-------+---+--------+------+-----+---------+-------+
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | + |
| 2 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | - |
| 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | + |
+-------------+---------+-------+---+--------+------+-----+---------+-------+


This code generates and shows a probability table by using the “prettytable” package. The keywords are arranged with ‘Document ID’ at the start and ‘Class’ at the conclusion. Field names are set to keywords when creating a PrettyTable object, and a title is supplied. Next, document IDs and class labels are added to the table together with the probability values from the probability_table. The code determines the total number of occurrences and words for each class (‘+’ and ‘-‘) after printing the probability table. For additional examination, it modifies the vocabulary size and eliminates pointless components from the list of keywords.

Positive Class

Python3

# For positive class
temp = []
for i in to_find:
    count = 0
    x = keywords.index(i)
    for j in probability_table:
        if j[len(j)-1] == "+":
            count = count+j[x]
    temp.append(count)
    count = 0
for i in range(len(temp)):
    temp[i] = format((temp[i]+1)/(vocabulary+totalpluswords), ".4f")
print()
temp = [float(f) for f in temp]
print("Probabilities of Each word to be in '+' class are: ")
h = 0
for i in to_find:
    print(f"P({i}/+) = {temp[h]}")
    h = h+1
print()
pplus = float(format((totalplus)/(totalplus+totalneg), ".8f"))
for i in temp:
    pplus = pplus*i
pplus = format(pplus, ".8f")
print("probability of Given text to be in '+' class is :", pplus)
print()

                    

Output:

Probabilities of Each word to be in '+' class are: 
P(i/+) = 0.1429
P(hated/+) = 0.0714
P(the/+) = 0.1429
P(acting/+) = 0.1429
probability of Given text to be in '+' class is : 0.00013890

With the input text, this code calculates the likelihood that each word belongs to the positive class (‘+’). Iteratively going over each word in “to_find,” it determines how often each word occurs in the positive class based on the probability table and uses Laplace smoothing to obtain the conditional probabilities. After that, the results are written out, displaying the probability of each word receiving the positive class. Lastly, it uses these word probabilities to compute the overall chance that the input text belongs to the positive class, and it prints the outcome. Non-zero probabilities for unseen words are guaranteed by the Laplace smoothing.

Negative class

Python3

# For Negative class
temp = []
for i in to_find:
    count = 0
    x = keywords.index(i)
    for j in probability_table:
        if j[len(j)-1] == "-":
            count = count+j[x]
    temp.append(count)
    count = 0
for i in range(len(temp)):
    temp[i] = format((temp[i]+1)/(vocabulary+totalnegwords), ".4f")
print()
temp = [float(f) for f in temp]
print("Probabilities of Each word to be in '-' class are: ")
h = 0
for i in to_find:
    print(f"P({i}/-) = {temp[h]}")
    h = h+1
print()
pneg = float(format((totalneg)/(totalplus+totalneg), ".8f"))
for i in temp:
    pneg = pneg*i
pneg = format(pneg, ".8f")
print("probability of Given text to be in '-' class is :", pneg)
print('\n')

                    

Output:

Probabilities of Each word to be in '-' class are: 
P(i/-) = 0.1667
P(hated/-) = 0.1667
P(the/-) = 0.1667
P(acting/-) = 0.0833
probability of Given text to be in '-' class is : 0.00012863

The probability that each word in the input text belongs to the negative class (‘-‘) are calculated by this code. Iterating through every word in “to_find,” it determines each word’s occurrences in the negative class using the probability table, and then computes conditional probabilities using Laplace smoothing, just like the positive class computation does. The probability of each word being assigned to the negative class is then printed along with the findings. Lastly, it uses these word probabilities to compute the overall chance that the input text belongs to the negative class, and it prints the result. In both positive and negative class calculations, the Laplace smoothing guarantees non-zero probabilities for unseen words.

Prediction

Python3

if pplus > pneg:
    print(
        f"Using Naive Bayes Classification, We can clearly say that the given text belongs to '+' class with probability {pplus}")
else:
    print(
        f"Using Naive Bayes Classification, We can clearly say that the given text belongs to '-' class with probability {pneg}")
print('\n')

                    

Output:


Probabilities of Each word to be in '+' class are:
P(i/+) = 0.1538
P(hated/+) = 0.0769
P(the/+) = 0.1538
P(acting./+) = 0.1538
probability of Given text to be in '+' class is : 0.00018651
Probabilities of Each word to be in '-' class are:
P(i/-) = 0.1818
P(hated/-) = 0.1818
P(the/-) = 0.1818
P(acting./-) = 0.0909
probability of Given text to be in '-' class is : 0.00018206
Using Naive Bayes Classification, We can clearly say that the given text belongs to '+' class with probability 0.00018651


The probabilities computed for the positive and negative classes are the basis for this code’s ultimate judgment. It prints a statement proposing a positive class prediction together with the corresponding probability if the likelihood of the text falling into the positive class (pplus) is higher than the likelihood of it falling into the negative class (pneg). If not, a message with the corresponding probability and a negative class forecast is printed.

Also Check:

Classification of Text Documents using the approach of Naive Bayes

In natural language processing and machine learning, the Naïve Bayes approach is a potent and popular method for classifying text documents. This method classifies documents into predetermined types based on the likelihood of a word occurring, utilizing the concepts of the Bayes theorem. This article aims to implement Document Classification using Naïve Bayes using Python.

Similar Reads

Text Classification using Naive Bayes

A probabilistic classification technique, the naïve Bayes algorithm is predicated on robust, if naïve, independence assumptions in its probability models. Despite their simplicity, these presumptions serve as the algorithm’s foundation. Even if it frequently deviates from reality, the independence assumption adds to its “naive” characterization....

Implementation to classify text documents using Naive Bayes

Importing Libraries...

Frequently Asked Questions (FAQs)

...