In probability, Bayes is a type of conditional probability. It predicts the event based on an event that has already happened. You can use Naive Bayes as a supervised machine learning method for predicting the event based on the evidence present in your dataset. In this tutorial, you will learn how to classify the email as spam or not using the Naive Bayes Classifier.
Before doing coding demonstration, Let’s know about the Naive Bayes in a brief.
What is the Naive Bayes Classifier Model?
Naive Bayes is based on the popular Bayesian Machine learning algorithm. It is called as Naive as it assumes that all the predictors in the dataset are independent of each other. Naive Bayes Classifier Algorithm is mostly used for binary and multiclass classification. The formulae for the conditional probability is
There are three types of Naive Bayes Model
You apply multinomial when the features or variable (Categorical or Continuous) have discrete frequency counts. For example, you want to classify as spam or not, then you will use word counts in the body of the mail.
It is good to apply when you have a dataset have binary features. And Making prediction from the binary features. For example, a buyer will buy the house or not.
If the dataset features are continuous and normally distributed, then Gaussian is good for making predictions.
The popular use cases of the Naive Bayes Classifiers are the following
- Spam Detection
- Classification of the customer
- Loan Classification
- Health Risk Prediction
The assumption for Naive Bayes Classifiers
Before modeling the prediction model, always check the following assumptions
1. All the predictor’s features or variable should be independent of each other.
2. It is based on conditional probability. Therefore historical event matters and should be true for prediction the present events.
Step 1: Import the necessary packages and libraries
import numpy as np import pandas as pd import urllib import sklearn from sklearn.naive_bayes import BernoulliNB from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score
Sklearn is machine learning packages. You will import Gaussian, Bernoulli and Multinomial model from the sklearn.naive_bayes.
Import the train test split function from the sklearn.model_selection and for accuracy score import the accuracy_score from the sklearn.metrics.
Step 2: Load the Dataset
In the coding demonstration, I am using Naive Bayes for spam classification, Here I am loading the dataset directly from the UCI Dataset direction using the python urllib packages.
url =" http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data" raw_data = urllib.urlopen(url) dataset = np.loadtxt(raw_data,delimiter=",") dataset
If you look at the dataset there are 57 attributes predictors and 48 features have attributes with the percentage of word count. We will take these attributes as predictors and the last attribute has binary values 0 (not spam) and 1( spam ) as the target.
x = dataset[:,:48]
y = dataset[:,-1]
Step 3: Split the Dataset to train and test function
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size = 0.33, random_state = 17)
Using the sklearn.model_selection , you will split the dataset into train and text with the test size of 0.33. Please note that for the exact output use the same value of random_state that is 17.
Step 4: Model the Naive Bayes Prediction on the dataset.
In this step, we will build all the Naive Bayes model and after comparing you will select the best model.
BernNB = BernoulliNB(binarize=True) BernNB.fit(x_train,y_train) print(BernNB) y_expect = y_test y_predict = BernNB.predict(x_test) accuracy_score(y_expect,y_predict)
MultiNB = MultinomialNB() MultiNB.fit(x_train,y_train) print(MultiNM) y_expect = y_test y_predict = MultiNB.predict(x_test) accuracy_score(y_expect,y_predict)
GaussNB = GaussianNB() GaussNB.fit(x_train,y_train) print(GaussNB) y_expect = y_test y_predict = GaussNB.predict(x_test) accuracy_score(y_expect,y_predict)
In all of the three, the accuracy score of the Multinomial is more than the others. Then we will select this model. You can improve the score by doing some modification of arguments values. Like in the case of the Bernoulli model, if you will use the binarize = 0.25 then the score will be 0.8966 that is more than the others. Thus you will choose that model with the highest score.
Performance Matrices for Classification :
There are couple of the performance matrices for classification models like confusion matrix , AUC – ROC curve , F-1 score , Precision and recall and accuracy. In the above demonstration, we have used the accuracy matrix. Which one is best is completely depend on the problem statements.
Naive Bayes is the conditional probability based Machine Learning model. You use it as a binary or multiclass classification model. In fact, Choosing the model will depend upon the accuracy score of the all its types Bernoulli, Multinomial and Gaussian score. Higher the score more the accurate predictions. You can also tweak some of the arguments to output the high score.
If you have any suggestion regarding this tutorial, then please message us on Data Science Learner Page.
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.