How to Predict using Logistic Regression in Python ? 7 Steps

Logistic Regression is one of the popular Machine Learning Algorithm that predicts numerical categorical variables. It is a supervised Machine Learning Algorithm for the classification. You can think this machine learning model as Yes or No answers. For example, you have a customer dataset and based on the age group, city, you can create a Logistic Regression to predict the binary outcome of the Customer, that is they will buy or not. In this tutorial of How to, you will learn ” How to Predict using Logistic Regression in Python “.

Difference Between the Linear and Logistic Regression

Linear Regression: In the Linear Regression you are predicting the numerical continuous values from the trained Dataset. That is the numbers are in a certain range.

Logistic Regression: In it, you are predicting the numerical categorical or ordinal values. It means predictions are of discrete values.

Popular Use Cases of the Logistic Regression Model

There are many popular Use Cases for Logistic Regression. Some of them are the following :

Purchase Behavior: To check whether a customer will buy or not.

Disaster Prediction: Predict the possibility of Hazardous events like Floods, Cyclone e.t.c

Diseases Prediction: Possibilities of Cancer in a person or not.

Handwriting recognition

Assumptions on the DataSet

The followings assumptions are applied before doing the Logistic Regression. You must remember these as a condition before modeling.

There should be no missing values in the dataset.
The target feature or the variable must be binary (only two values) or the ordinal ( Categorical Variable With the ordered values).
All the other data variables should not have any relationship. It means they are independent and have no correlation between them.
The data shall contain values not less than 50 observations for the reliable results.

Step by Step for Predicting using Logistic Regression in Python

Step 1: Import the necessary libraries

Before doing the logistic regression, load the necessary python libraries like numpy, pandas, scipy, matplotlib, sklearn e.t.c .

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pylab import rcParams
import seaborn as sb
import scipy
from scipy.stats import spearmanr

import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import scale
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
import sklearn.metrics as sm

Here you are importing for the following purposes

rcParams for matplotlib visualization parameters.
spearmanr for finding the spearman rank coefficient. It used for checking the dependent or independent variable.
scale for normalization of the dataset.
train_test_split for dividing the training and test dataset.
sklearn metrics for accuracy report generation.

Step 2: Define the Parameter for the Matplotlib

%matplotlib inline
rcParams["figure.figsize"] =10,5
sb.set_style("whitegrid")

It tells the python interpreter to show all the figures inline in Jupyter Notebook.

Step 3: Load the Dataset

In this step, you will load and define the target and the input variable for your model. I am using the mtcars dataset. You can download from the GitHub URL.

address = "data/mtcars.csv"
cars= pd.read_csv(address)
cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']

data = cars.iloc[:,[5,11]].values
data_names = ["drat","carb"]
y = cars.iloc[:,[9]].values

Step 4: Check for the independence of the variable.

drat= cars["drat"]
carb = cars["carb"]
#Find the Spearmen Cofficient.
spearmanr_coff, p_value = spearmanr(drat,carb)
spearmanr_coff
#negative no correlation

The Spearman rank’s coefficient is negative therefore we can say drat and the carb variable has no correlation. These two are independent of each other.

Step 5: Check for the missing values

cars.isnull().sum()

You can see there are no missing values in the dataset that is good. If you find any missing values in the dataset then remove or replace it. Read the following tutorial for dealing with the missing values.

Steps to Deal with the missing values.

Step 6: Data is binary or Ordinal? Check it

sb.countplot(x="am",data=cars,palette="hls")

From the figure, you can say the variables are binary that has only 0 and 1 values.

Step 7: Deploy and check the accuracy of the model

x = scale(data)
LogReg = LogisticRegression()
#fit the model
LogReg.fit(x,y)
#print the score
print(LogReg.score(x,y))

After scaling the data you are fitting the LogReg model on the x and y. The LogReg.score(x,y) will output the model score that is R square value. In this case, the score is 0.8125 that is good. You can use the sklearn metrics for the classification report. If there are High recall and High

y_predict = LogReg.predict(x)
from sklearn.metrics import classification_report
report = classification_report(y,y_predict)
print(report)

The precision and recall of the above model are 0.81 that is adequate for the prediction. Just remember you look for the high recall and high precision for the best model.

Conclusion:

Logistic Regression is the popular way to predict the values if the target is binary or ordinal. Only the requirement is that data must be clean and no missing values in it. You can use it any field where you want to manipulate the decision of the user. Just follow the above steps and you will master of it.

Hope this tutorial on How to Predict using Logistic Regression in Python? benefited you in the deployment of the model on your own dataset. If you have any query regarding this then please contact or message on our official data science learner page.