Bernoulli Naive Bayes

Bernoulli Naive Bayes is a subcategory of the Naive Bayes Algorithm. It is used for the classification of binary features such as ‘Yes’ or ‘No’, ‘1’ or ‘0’, ‘True’ or ‘False’ etc. Here it is to be noted that the features are independent of one another. Bernoulli Naive Bayes is basically used for spam detection, text classification, Sentiment Analysis, used to determine whether a certain word is present in a document or not. The decision rule of Bernoulli NB is given as follows

Here, p(xi |y) is the conditional probability of xi occurring provided y has occurred.

i is the event

xi holds binary value either 0 or 1

Implementing Bernoulli Naive Bayes

For performing classification using Bernoulli Naive Bayes we have considered an email dataset.

The email dataset comprises of four columns named Unnamed: 0, label, label_num and text. The category of label is either ham or spam. For ham the number assigned is 0 and for spam 1 is assigned. Text comprises the body of the mail. The length of the dataset is 5171. The dataset can be downloaded from here.

Python3

import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

In the above code we have imported necessary libraries like pandas, numpy and sklearn. Bernoulli Naive Bayes is a part of sklearn package.

Python3

df=pd.read_csv("/content/spam_ham_dataset.csv")
print(df.shape)
print(df.columns)
df= df.drop(['Unnamed: 0'], axis=1)

In this above code we have performed a quick data analysis that includes reading the data, dropping unnecessary columns, printing shape of data, information about dataset etc.

Python3

x =df["text"].values
y = df["label_num"].values
# creating count vectorizer object
cv = CountVectorizer()
#tranforming values
x = cv.fit_transform(x)

In the above code since text data is used to train our classifier we convert the text into a matrix comprising numbers using Count Vectorizer so that the model can perform well.

Python3

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)
bnb = BernoulliNB(binarize=0.0)
model = bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
           0       0.84      0.98      0.91       732
           1       0.92      0.56      0.70       303
    accuracy                           0.86      1035
   macro avg       0.88      0.77      0.80      1035
weighted avg       0.87      0.86      0.84      1035

In the above code we have divided the data into train and test in the ration 80:20. Then we trained the model using the training data and generated a classification report using test data and predicted data. From the classification report it can be seen that the precision, recall and f1 score of class 0 is 0.84, 0.98 and 0.91 respectively whereas for class 1 the precision, recall and f1 score are 0.92, 0.56 and 0.70 respectively. Since 13% of the dataset comprises spam category there is a drop in the value of recall. The overall accuracy of the model is 86% which is good.

Advantages

There are many advantages of Bernoulli Naive Bayes

It is simple and efficient as it gives good accuracy for small dataset.
It performs well for binary dataset.
It works best for text classification as it works on the principle of independence.

Disadvantages

There are many disadvantages of using this model. Some of them are as follows:

Since it uses Naive Bayes, it assumes that all the features are independent which often causes the model to generate inappropriate results.
It is not suitable for multiclass problem.
When there is class imbalance Bernoulli Naive Bayes cannot handle properly thus leading to a drop in overall accuracy of the model.