In real-world scenarios, we collect data from different sources, and inconsistency in data is a very general problem. In machine learning or data science, we call it noise in data. As we know while building a model if garbage is in then garbage is out. Missing value in any dataset is also a kind of noise. There are multiple ways handling missing values in data with Python (machine learning). In this article, we will explore all of them.
There are two types of variables in the dataset.
We will first create one random dataset with both types of columns and have missing values as well.
We can handle any missing values either by:
Deleting or imputing but deletion is not a good option. Mostly we should impute the missing values by approximating them. But there are scenarios where we can delete the observation.
If the entire row or observation is empty.
df=df.dropna(how='all')
Dropna missing valuesIf the column contains less than 5 percent data. we can delete the specific column also. The syntax will be the same, all we need to need to add axis=1 in the above syntax.
There are multiple tricks to this. Here are the listed ways.
When the missing values are not completely random then we can apply this technique. Out of the three central tendancies median is the best strategy. The reason is that it is not sensitive to outliers. Here is the code for this technique.
df['Ex-price'].fillna(df['Ex-price'].median(),inplace=True)
#df['Ex-price'].fillna(df['Ex-price'].mode(),inplace=True)
#df['Ex-price'].fillna(df['Ex-price'].mean(),inplace=True)
Here if the missing values are completely random then we can adopt this technique ( Handling Missing Values in Data with Python with Random sample ). Here we follow the below steps.
import random
age_list=df['Age'].tolist()
df['Age']=df['Age'].fillna('00000')
def miss_value_filler(age):
if age=="00000":
filler=random.choice(age_list)
return filler
else:
return age
df['Age']=df['Age'].apply(miss_value_filler)
This is also one of the useful ways to impute meaningful values to the null values. Here basically the algorithm ( K nearest Neighbor )tries to find the nearest record and then replace the values with respective values.
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
df['Age'] = knn_imputer.fit_transform(df[['Age']])
We can also fill in missing values using the previous value or backward
df['Age']=df['Age'].ffill(axis = 0)
df['Age']=df['Age'].bfill(axis = 0)
In our dataset the categorical column is Brand. Also, you will find a lot of values are missing from it. There are two ways to deal with such values or placeholders. The first is to drop the rows or observation that contains missing value but that is loss of information. Hence we usually only delete the rows which has no values for any column. Mostly the best strategy is to impute these categories with the best approximation. In this section, we will explore those techniques.
This is a one-liner code if we use fillna(). Here we can provide the mode of the column name for which we need to fill the missing categorical value with the most Frequent Category. As we know the mode of series is nothing but the highest occurring element.
df['Brand '].fillna(df['Brand '].mode().to_list()[0],inplace=True)
This method is completely identical to what we have in numeric. If the missing values are random then we should go for this technique. In the below sample code, we only replace the column name. You can try it once for better understanding.
import random
brand_list=df['Brand '].tolist()
df['Brand ']=df['Brand '].fillna('00000')
def miss_value_filler(brand):
if brand=="00000":
filler=random.choice(brand_list)
return filler
else:
return brand
df['Brand ']=df['Brand '].apply(miss_value_filler)
Here we will fill the missing categorical value with any technique like most frequent etc. but we will create an additional column to keep this information that this row has a missing value. We have observed this technique literally outperforms sometimes.
If the missing value is not randomly missing and the rest data in the row or observation is related then we can build a classifier model where the categorical variable is the dependent variable ( target ) and the rest columns will work as input features.
We have covered so many tricks related to missing data handling in Python. Also understood which one is important and when. To summarize the complete view in one frame. Please refer to the below table.
Thanks
Data Science Learner Team