What is Feature Engineering : Use it Optimize Datasets

Feature Engineering ( Python ) involves cleaning, transforming, validating, and balancing the data. We perform these steps just after collecting the raw data while machine learning modeling. Feature Engineering has one of the most significant impacts on the accuracy and performance model. Actually, this is a multi-step process, We performed the below steps to avoid noise in the dataset.

Let us understand what issue each step helps to solve. For those readers who love to understand Feature Engineering from video, We have created a video tutorial for exploring noise issues in data.

Data Preprocessing and Feature Engineering ( Python )-

There are basically the five issues we address while preprocessing the dataset. Let us understand them one by one.

MISSING VALUE HANDLING –

If the dataset contains any missing values in the data, it can mislead the model for that particular observation partially or completely. Here we use some approximation techniques to deal with missing values.

Categorical Variable Handling ( ENCODING ) –

Most of the Machine Learning Models can not handle categorical data. In order to handle this issue, we have to convert categorical data into numerical data. There are multiple techniques for this.

3. Feature Scaling & Transformation –

In Modeling, we use optimization Algorithms like Gradient Descent, etc. While using optimization Algorithms the magnitude scales are really important for processing time. There we need to Scale all the features in the same scale.

4. Handing outliers in the dataset –

If there is some data point which is not making sense at all. It can be very high or very low according to feature distribution. Let’s see the below example to understand it in more depth.

5. Handling imbalanced dataset –

If we deal with classification tasks the output distribution we have is not equally distributed. If we build the model then it will show the overfitting problem. To avoid this we need to balance the dataset using the below technique.

Over Sampling
Down Sampling

Handling imbalance dataset — Handling imbalanced dataset

Thanks

Data Science Learner Team