Data preprocessing is the first steps in any Machine Learning or predictive analytics . Before you start reading this article , I would like to inform you that This article is exclusively for Python developer / data scientist beginners and aspirants . This article – Data Preprocessing in Python : Importance covers the journey from beginner to advance level learner . Now lets start –
Data Preprocessing in Python : Importance
There are several reason why we perform data preprocessing –
- Most of Machine Learning performance get slow down if feature (data ) are not scale . Lets understand suppose you have two features one is in the scale between ( 0-2 ) and other ( 0-1000000) . Now if you are performing regression on the top of it . There will be so many iteration of adjustment in the value of regression coefficient in order to achieve accurate prediction . This phenomena will increase the time in training data set .But if you scale them uniformly it would be performance oriented .
- We should always remove unexpected values from data set . For example random forest algorithm do not support null values . So replacing such values with some significant sort of values is also under data preprocessing .
- Data set should be in the condition where we can easily changes the underline machine leaning algorithm over it .Here preprocessing principal convert them in compatible format .
- We have to convert the categorical data into numeric one . As you know , all machine learning underline work on numeric data ( not on text ).
I think these are enough reason for you to read and hands on Data Preprocessing in Python .There are several others but these were major .
Domain data is a bottle neck ? –
Domain data is something which may create problem in preprocessing . Please do not follow the predefined or usually defined preprocessing lifecycle with domain data . Domain data is something where you have to understand which technique can help you the most . usually the null value is either dropped or replaced but in domain application it may help you as well . It is just awareness check for you regarding your data .
The scope of this article was to introduce you with the importance of preprocessing . I have seen team usually invest lot of time in finding best machine learning algorithms . They try varies combination of machine learning models . Still they never get good accuracy . See Data Science is more on the data and less is algo . We usually ignore this . If It is all about the algorithms we all are not scientist . The scientist tag is just due to we are there to identify pattern in data . We shape the data . We also ensure that the algorithms must get proper data .And you know its all about preprocessing . I always encourage to give at least 25 % time in understanding , cleaning and shaping data .
I hope ! This article will be a motivator for you in preprocessing . If you want to share your own story of preprocessing . You may describe that how preprocessing change your evaluation matrix board . We love to hear back from our readers . In fact we love to be the audience of our audience .
Data Science Learner Team