Data Preprocessing in Python : Importance

Data preprocessing is the first steps in any Machine Learning or predictive analytics . Before you start reading this article , I would like to inform you that This article is exclusively  for Python developer / data scientist beginners and aspirants . This article – Data Preprocessing in Python : Importance covers the journey from beginner to advance level learner . Now lets start –

Data Preprocessing in Python : Importance

There are several reason why we perform data preprocessing

  1. Most of Machine Learning performance get slow down if feature (data ) are not scale . Lets understand suppose you have two features one is in the scale between ( 0-2 ) and other ( 0-1000000) . Now if you are performing regression on the top of it . There will be so many iteration of adjustment in the value of regression coefficient  in order to achieve  accurate prediction . This phenomena will increase the time in training data set .But if you scale them uniformly it would be performance oriented .
  2. We should always remove unexpected values from data set . For example random forest algorithm do not support null values . So replacing such values with some significant sort of values is also under data preprocessing .
  3. Data set should be in the condition where we can easily changes the underline machine leaning algorithm over it .Here preprocessing principal convert them in compatible format .
  4. We have to convert the categorical data into numeric one . As you know , all machine learning underline work on numeric data ( not on text ).

I think these are enough reason for you to read and hands on Data Preprocessing in Python .There are several others but these were major .

Domain data is a bottle neck ? –

Domain data is something which may create problem in preprocessing . Please do not follow the predefined or  usually defined preprocessing lifecycle with domain data . Domain data is something where you have to understand which technique can help you the most . usually the null value is either dropped or replaced but in domain application it may help you as well . It is just awareness check for you regarding your data .

Conclusion –

The scope of  this article was to introduce you with the importance of preprocessing . I have seen team usually invest lot of time in finding best machine learning algorithms . They try varies combination of machine learning models . Still they never get good accuracy . See Data Science is more on the data and  less is algo . We usually ignore this . If It is all about the algorithms we all are not scientist . The scientist tag is just due to we are there to identify pattern in data . We shape the data . We also ensure that the algorithms must get proper data .And you know its all about preprocessing . I always encourage to give at least  25 % time in understanding , cleaning and shaping data .

I hope ! This article will be a motivator for you in preprocessing . If you want to share your own story of preprocessing . You may describe that how preprocessing change your evaluation matrix board . We love to hear back from our readers . In fact we love to be the audience of our audience .


Data Science Learner Team 

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages, where he and his team share knowledge and help others learn more about data science.
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner