Data Science Basics

How to Deal with Missing Data in Python ?

Data scientist works on the large dataset for doing better analysis. It can lead to wrong predictions if you have a dataset and have missing values in the rows and columns. How to deal with missing data is a major task for every data scientist for the correct prediction. It is one of the top steps for data preprocessing steps. If you want to know about it then follow our post on it.

Top 4 Data Pre Processing Steps.

In this tutorial of “How to”, you will learn the following things.

  1. How to find the missing values in the Dataset?
  2. The methods for filling the missing values.
  3. How to find the total number of the missing values in the dataset?
  4. Filtering the missing values?

Follow the step by step methods for getting more knowledge on this lesson.

Step 1: Import the necessary libraries.

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

Step 2: Read the dataset using the Pandas.

For this example, I am reading the sales dataset. Though the data is complete but for the demonstration purpose I am defining some missing values for the Sales and Price columns in the dataset using the numpy nan method. If you have already missing values in the dataset then move to step 3.

address= "C:\\Users\\skrsu\Desktop\\Jypter\\data\\sales_data.csv"
sales_data = pd.read_csv(address)
sales_data.columns = ["orderNumber","Quantity","Price","Sales","Date"]

Let’s put the 2nd, 6th rows of the Price and 1st, 4th and 7th row of the Sales column to be NaN.
Use the following code for the traversing the specific rows and change their values to NaN.

sales_data.iloc[[2,6],2] = missing

For putting the missing value at the 2nd and 6th position of the Price Column.

sales_data.iloc[[1,4,7],3]=missing, will put the missing value at the 1st,4th and 7th position of the Sales column.

Step 3: Find there are missing data in the dataset or not.

Use the following method to find the missing value.

sales_data.isnull().sum()

It will tell you at the total number of missing values in the corresponding columns.

Step 4: Filling the missing values.

To do this you have to use the Pandas Dataframe fillna() method. You can fill the values in the three ways.
Lets I have to fill the missing values with 0, then I will use the method fillna(0) with 0 as an argument.

sales_data.fillna(0)

You can also fill the missing values with the mean of the data of the corresponding column. Like, in this case, I will fill the missing value with the mean of the Price and Sales using the fillna() method but instead of passing 0 as an argument, I will pass the dictionary.

#mean of the price and Sales Column
mean_price= sales_data["Price"].mean()
mean_sales = sales_data["Sales"].mean()

 

filled_sales_data= sales_data.fillna({"Price":mean_price,"Sales":mean_sales})
filled_sales_data

 

Now it fills all the missing values of Price and Sales column with the mean of the corresponding column.

You can also fill the missing values with the last non-value in the same column using the fillna(method=”ffill”)

sales_data.fillna(method="ffill")

Step 5: Filtering out the Null Data in the large dataset.

Suppose you have a large dataset or columns or rows in the dataset that has maximum null values. Then instead of filling their values using the method fillna(), you should remove or delete the rows and columns using the method dropna().

If you want to want to delete the rows then you can simply use the dropna() method without any axis. Like in this case

sales_data.dropna()

But for deleting the columns you have to pass the axis =1 as the argument of dropna(axis=1). You should always delete the columns only when most of the rows of that particular column is null. That’s why most of the data scientist use dropna().

In some case all the values inside the row are null, then, in that case, you should use dropna(how=”all”). In our dataset, there are no rows that all the values are null. Therefore the output will be just the simply original sales data containing the null values.

Conclusion

Data cleaning is a major process before modeling machine learning for better predictions. Pandas library is a popular library for optimization and cleaning the raw data and making it structured data.

We think that reading this tutorial given a basic understanding of “How to Deal with Missing Data in Python? If you want to suggest us to write the tutorial then contact us. You can also subscribe or Like us for faster learning.

Thanks 

Data Science Learner Team.