Logistic Regression is one of the popular Machine Learning Algorithm that predicts numerical categorical variables. It is a supervised Machine Learning Algorithm for the classification. You can think this machine learning model as Yes or No answers. For example, you have a customer dataset and based on the age group, city, you can create a Logistic Regression to predict the binary outcome of the Customer, that is they will buy or not. In this tutorial of How to, you will learn ” How to Predict using Logistic Regression in Python “.
Difference Between the Linear and Logistic Regression
Linear Regression: In the Linear Regression you are predicting the numerical continuous values from the trained Dataset. That is the numbers are in a certain range.
Logistic Regression: In it, you are predicting the numerical categorical or ordinal values. It means predictions are of discrete values.
Popular Use Cases of the Logistic Regression Model
There are many popular Use Cases for Logistic Regression. Some of them are the following :
Purchase Behavior: To check whether a customer will buy or not.
Disaster Prediction: Predict the possibility of Hazardous events like Floods, Cyclone e.t.c
Diseases Prediction: Possibilities of Cancer in a person or not.
Assumptions on the DataSet
The followings assumptions are applied before doing the Logistic Regression. You must remember these as a condition before modeling.
- There should be no missing values in the dataset.
- The target feature or the variable must be binary (only two values) or the ordinal ( Categorical Variable With the ordered values).
- All the other data variables should not have any relationship. It means they are independent and have no correlation between them.
- The data shall contain values not less than 50 observations for the reliable results.
Step by Step for Predicting using Logistic Regression in Python
Step 1: Import the necessary libraries
Before doing the logistic regression, load the necessary python libraries like numpy, pandas, scipy, matplotlib, sklearn e.t.c .
import numpy as np import pandas as pd import matplotlib.pyplot as plt from pylab import rcParams import seaborn as sb import scipy from scipy.stats import spearmanr import sklearn from sklearn import preprocessing from sklearn.preprocessing import scale from sklearn import datasets from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split import sklearn.metrics as sm
Here you are importing for the following purposes
- rcParams for matplotlib visualization parameters.
- spearmanr for finding the spearman rank coefficient. It used for checking the dependent or independent variable.
- scale for normalization of the dataset.
- train_test_split for dividing the training and test dataset.
- sklearn metrics for accuracy report generation.
Step 2: Define the Parameter for the Matplotlib
%matplotlib inline rcParams["figure.figsize"] =10,5 sb.set_style("whitegrid")
It tells the python interpreter to show all the figures inline in Jupyter Notebook.
Step 3: Load the Dataset
In this step, you will load and define the target and the input variable for your model. I am using the mtcars dataset. You can download from the GitHub URL.
address = "data/mtcars.csv" cars= pd.read_csv(address) cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'] data = cars.iloc[:,[5,11]].values data_names = ["drat","carb"] y = cars.iloc[:,].values
Step 4: Check for the independence of the variable.
drat= cars["drat"] carb = cars["carb"] #Find the Spearmen Cofficient. spearmanr_coff, p_value = spearmanr(drat,carb) spearmanr_coff #negative no correlation
The Spearman rank’s coefficient is negative therefore we can say drat and the carb variable has no correlation. These two are independent of each other.
Step 5: Check for the missing values
You can see there are no missing values in the dataset that is good. If you find any missing values in the dataset then remove or replace it. Read the following tutorial for dealing with the missing values.
Step 6: Data is binary or Ordinal? Check it
From the figure, you can say the variables are binary that has only 0 and 1 values.
Step 7: Deploy and check the accuracy of the model
x = scale(data) LogReg = LogisticRegression() #fit the model LogReg.fit(x,y) #print the score print(LogReg.score(x,y))
After scaling the data you are fitting the LogReg model on the x and y. The LogReg.score(x,y) will output the model score that is R square value. In this case, the score is 0.8125 that is good. You can use the sklearn metrics for the classification report. If there are High recall and High
y_predict = LogReg.predict(x) from sklearn.metrics import classification_report report = classification_report(y,y_predict) print(report)
The precision and recall of the above model are 0.81 that is adequate for the prediction. Just remember you look for the high recall and high precision for the best model.
Logistic Regression is the popular way to predict the values if the target is binary or ordinal. Only the requirement is that data must be clean and no missing values in it. You can use it any field where you want to manipulate the decision of the user. Just follow the above steps and you will master of it.
Hope this tutorial on How to Predict using Logistic Regression in Python? benefited you in the deployment of the model on your own dataset. If you have any query regarding this then please contact or message on our official data science learner page.