What is Feature Engineering : Use it Optimize Datasets

Feature Engineering ( Python ) involves cleaning, transforming, validating, and balancing the data. We perform these steps just after collecting the raw data while machine learning modeling.  Feature Engineering has one of the most significant impacts on the accuracy and performance model. Actually, this is a multi-step process, We performed the below steps to avoid noise in the dataset.

Data Preprocessing task
Data Preprocessing task

Let us understand what issue each step helps to solve. For those readers who love to understand Feature Engineering from video, We have created a video tutorial for exploring noise issues in data.

Data Preprocessing and Feature Engineering  ( Python )-

 

There are basically the five issues we address while preprocessing the dataset. Let us understand them one by one.

 

MISSING VALUE HANDLING –

If the dataset contains any missing values in the data, it can mislead the model for that particular observation partially or completely.  Here we use some approximation techniques to deal with missing values.

missing value handling in data preprocessing
missing value handling in data preprocessing

 

Categorical Variable Handling ( ENCODING ) –

Most of the Machine Learning Models can not handle categorical data. In order to handle this issue, we have to convert categorical data into numerical data.  There are multiple techniques for this.

 

Categorical Variable Handling
Categorical Variable Handling

 

3. Feature Scaling & Transformation –

In Modeling, we use optimization Algorithms like Gradient Descent, etc. While using optimization Algorithms the magnitude scales are really important for processing time. There we need to Scale all the features in the same scale.

Feature SCALING & Transformation
Feature SCALING & Transformation

 

4. Handing outliers in the dataset –

If there is some data point which is not making sense at all. It can be very high or very low according to feature distribution. Let’s see the below example to understand it in more depth.

Handing outliers in dataset
Handing outliers in dataset

 

5. Handling imbalanced dataset –

If we deal with classification tasks the output distribution we have is not equally distributed. If we build the model then it will show the overfitting problem. To avoid this we need to balance the dataset using the below technique.

  • Over Sampling
  • Down Sampling
Handling imbalance dataset
Handling imbalanced dataset

 

Thanks

Data Science Learner Team

 

 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner