Text Classification Using Naive Bayes in Python : 7 Steps

Classification is a machine learning algorithm for supervised learning. As you already know that the machine learning takes input only the numbers. From those inputs, it builds a classification model based on the target variables. After that when you pass the inputs to the model it predicts the class for the new inputs. But wait do you know how to classify the text. If no then read the entire tutorial then you will learn how to do text classification using Naive Bayes in python language.

The coding part is explained later. But before going to that part first you should know how the text classification is done.

How Classification is done on the text?

In documents, each word becomes a feature variable. And in each document are tagged for a particular class. These tagged documents are used as the target variable. Now the classification algorithms require input and target variable to be numeric. Therefore you will create the TF-IDF matrices for the classification.

Step by Steps Guide for classification of the text.

Step 1: Import the necessary libraries

import os
import nltk
import sklearn

First of all import the necessary libraries useful in this example. NLTK module for converting text data into TF-IDF matrices, sklearn for data preprocessing and Naive Bayes modeling and os for file paths.

Step 2: Read the necessary files

Description Text File

#read the descriptions file
file = open(os.getcwd()+ "/post-descriptions.txt","rt")
p_descriptions = file.read().splitlines()
file.close()

Classification Text File

#read the classifcation file
file = open(os.getcwd()+ "/post-classifcations.txt","rt")
p_classification = file.read().splitlines()
file.close()

You can download both files from this link. Description Text File, Classification Text File

Step 3: Remove Stopwords and lemmatize the text word.

For this step, I have created a custom function cutom_tokenizer() that will return the lemmatized word after removing the stopwords. But before removing stopwords and to do lemmatization you have to first download and import the stopwords list and wordnet.

nltk.download("stopwords")
from nltk.corpus import stopwords
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer

#lematized words
def cutom_tokenizer(str):
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(str)
    remove_stopwords = list(filter(lambda token: token not in stopwords.words("english"),tokens))
    lematize_words = [lemmatizer.lemmatize(word) for word in remove_stopwords]
    return lematize_words

The function is first tokenizing the entire text using nltk.word_tokenizer() and then removing the stopwords (English Language Only) and lemmatizing the words.

Step 4: Create the TFIDF matrix for the Input text

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=cutom_tokenizer)
tfidf = vectorizer.fit_transform(p_descriptions)

If You wants to know more about the TFIDF matrix then read the Advanced Text Processing Tutorial.

Step 5: Label the Classification Text

Before building the model it is necessary to generate numerical data for each of the classes in the text. You can do it through sklearn label encoder.

#Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(p_classification)
print(le.classes_)

#convert the classes into numeric value
class_in_int = le.transform(p_classification)
print(class_in_int)

Step 6: Build the Model

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
#split into training and test dataset
#x_train,x_test,y_train,y_test = train_test_split(tfidf,class_in_int,test_size = 0.2,random_state=0)
#Model Building
classifier = MultinomialNB()
#classifier.fit(x_train,y_train)
classifier.fit(tfidf,class_in_int)

Here the text data provided is not large that’s why I am building the model on the entire original text data. If you have large text data then you can split the dataset into train and test dataset. to build the same model.

Step 7: Predict the score

pred = classifier.predict(tfidf)
print(metrics.confusion_matrix(class_in_int,pred),"\n")
print(metrics.accuracy_score(class_in_int,pred))

Finally, you have built the classification model for the text dataset. If you directly read the other website posts then you can find the very length and confusing tutorial. Here at data science learner, We have given simple steps that you should follow to build a better text classification model.

Hope you have clearly understood it. If you have any suggestions and want to improve this tutorial then you can contact or message us at our official data science learning twitter handle.