Handling Missing Values in Data with Python ( Machine Learning )

Contents

Handling Missing Values in Data with Python ( Tricks )

Handling Missing Values in Numerical Data :

1. Deletion of missing values Numerical :

2. Imputation of missing values Numeric –

Handling Missing Values in Categorical Data :

1. Replacement of missing value with Most Frequent Category :

2. Random Sampling Replacement for Category:

3. Missing Values Indicator column addition-

4. Classification model for Predicting missing value –

Conclusion –

In real-world scenarios, we collect data from different sources, and inconsistency in data is a very general problem. In machine learning or data science, we call it noise in data. As we know while building a model if garbage is in then garbage is out. Missing value in any dataset is also a kind of noise. There are multiple ways handling missing values in data with Python (machine learning). In this article, we will explore all of them.

Handling Missing Values in Data with Python ( Tricks )

There are two types of variables in the dataset.

Numerical Variable
Categorical Variables.

We will first create one random dataset with both types of columns and have missing values as well.

Missing Values

Handling Missing Values in Numerical Data :

We can handle any missing values either by:

Deletion of missing values
Imputation of missing values

1. Deletion of missing values Numerical :

Deleting or imputing but deletion is not a good option. Mostly we should impute the missing values by approximating them. But there are scenarios where we can delete the observation.

If the entire row or observation is empty.

df=df.dropna(how='all')

Dropna missing valuesIf the column contains less than 5 percent data. we can delete the specific column also. The syntax will be the same, all we need to need to add axis=1 in the above syntax.

2. Imputation of missing values Numeric –

There are multiple tricks to this. Here are the listed ways.

Imputing with mean | median | Mode
Impute with a random sample
K_Nearest Neighbor
Using forward fill and backward fill

2.1 Imputing with mean | median | Mode :

When the missing values are not completely random then we can apply this technique. Out of the three central tendancies median is the best strategy. The reason is that it is not sensitive to outliers. Here is the code for this technique.

df['Ex-price'].fillna(df['Ex-price'].median(),inplace=True)
#df['Ex-price'].fillna(df['Ex-price'].mode(),inplace=True)
#df['Ex-price'].fillna(df['Ex-price'].mean(),inplace=True)

Dropna missing values

2.2 Imputing Random Sample Value :

Here if the missing values are completely random then we can adopt this technique ( Handling Missing Values in Data with Python with Random sample ). Here we follow the below steps.

Create a list from the column where we want to impute.
Fill the missing values with some random unique string.
use apply() function in Pandas and if the value is above replace string then use the list and extract the element from the list and replace.

import random
age_list=df['Age'].tolist()
df['Age']=df['Age'].fillna('00000')

def miss_value_filler(age):
  if age=="00000":
    filler=random.choice(age_list)
    return filler
  else:
    return age
  
df['Age']=df['Age'].apply(miss_value_filler)

missing value imputation with random sample mode

2.3 Using K_Nearest Neighbor Imputer :

This is also one of the useful ways to impute meaningful values to the null values. Here basically the algorithm ( K nearest Neighbor )tries to find the nearest record and then replace the values with respective values.

from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
df['Age'] = knn_imputer.fit_transform(df[['Age']])

missing value handling with knn imputer

2.4 Using forward fill and backward fill :

We can also fill in missing values using the previous value or backward

df['Age']=df['Age'].ffill(axis = 0)
df['Age']=df['Age'].bfill(axis = 0)

missing value handling with ffill() and bfill()

Handling Missing Values in Categorical Data :

In our dataset the categorical column is Brand. Also, you will find a lot of values are missing from it. There are two ways to deal with such values or placeholders. The first is to drop the rows or observation that contains missing value but that is loss of information. Hence we usually only delete the rows which has no values for any column. Mostly the best strategy is to impute these categories with the best approximation. In this section, we will explore those techniques.

Replacement of missing value with Most Frequent Category.
Random Sampling Replacement for Category
Missing Values Indicator column addition.
Classification model for Predicting the missing value

1. Replacement of missing value with Most Frequent Category :

This is a one-liner code if we use fillna(). Here we can provide the mode of the column name for which we need to fill the missing categorical value with the most Frequent Category. As we know the mode of series is nothing but the highest occurring element.

df['Brand '].fillna(df['Brand '].mode().to_list()[0],inplace=True)

missing categorical value handling with the most frequent category replacement

2. Random Sampling Replacement for Category:

This method is completely identical to what we have in numeric. If the missing values are random then we should go for this technique. In the below sample code, we only replace the column name. You can try it once for better understanding.

import random
brand_list=df['Brand '].tolist()
df['Brand ']=df['Brand '].fillna('00000')

def miss_value_filler(brand):
  if brand=="00000":
    filler=random.choice(brand_list)
    return filler
  else:
    return brand
  
df['Brand ']=df['Brand '].apply(miss_value_filler)

random sampling for missing categorical value

3. Missing Values Indicator column addition-

Here we will fill the missing categorical value with any technique like most frequent etc. but we will create an additional column to keep this information that this row has a missing value. We have observed this technique literally outperforms sometimes.

Missing Values Indicator column addition

4. Classification model for Predicting missing value –

If the missing value is not randomly missing and the rest data in the row or observation is related then we can build a classifier model where the categorical variable is the dependent variable ( target ) and the rest columns will work as input features.

Conclusion –

We have covered so many tricks related to missing data handling in Python. Also understood which one is important and when. To summarize the complete view in one frame. Please refer to the below table.

Summary of how to handle missing values ( best Technique )

Thanks

Data Science Learner Team