How to design best Machine Learning Datasets – A Complete Guide

How to design best Machine Learning Datasets - A Complete Guide

If you are a beginner in Machine Learning , You must read this article . I have a very solid reason for this . Just read this first paragraph you will understand its importance . I have seen a common situation which influenced machine learning practitioner most . Specially beginners usually stuck here . Actually , First they choose some problem area . They worked hard and choose the best machine learning algorithm and framework . They collect data set and train the model   . When they go for cross validation ,They found very less accuracy. They change the machine learning algorithm and retrain the machine learning model .They keep doing this  for longer .Accuracy  and precision go below the expectation level and this make them frustrated . Finally they stopped trying . Actually most of the time , Accuracy in machine learning model is not an issue of machine learning algorithm . This happen because of immature Machine Learning datasets .

Machine Learning Datasets vs Machine Learning Algorithms –

See Machine Learning is not all about programming , Here Machine learning datasets are more important usually . So I thought , I should write an article which will help the machine learning practitioner in designing  the best machine learning datasets for their problem statements .In Todays time where you get most of the things immediate on Internet on just a single click . You must get the full code for machine learning model as well . That is the reason why Machine learning Datasets are more important .Every problem is different , So you need different data set every time . Even when you reuse old data set , You have to transform it .

Tips for Designing the Machine Learning Datasets-

There  are so many things which you should keep in mind while designing the Machine Learning datasets :

1. Quantity of Machine Learning Datasets-

When you train a child to recognize Banana ,  If you typically give 4-5 example , He /she will start correctly responding . Anyways , Machines are different from Human . Here you need to give thousands of  example for a small model training . Here the quantity of data is completely application dependent . In General , You should never train your model with less data .

2 . Data set cleaning –

One of the most important aspect is cleaning the data set . I mean you should write some code or use any tool to remove the noisy data .  If the unstructured you may use Natural Language Processing Techniques and libraries for cleaning data . In ordinary cases , Regex is also quite useful .

3.  Featured Selection –

Featured selection is very important . Lets understand with a daily life example . Suppose you have to purchase a car , Now there are so may factors which can effect your decision . For example –

1.Shape of car like , It is a sedan or SUV type

2. Color of Car

3. Engine Type Diesel or Petrol

4. Price of the car

5.  Horn sound of car .

6. Mileage of the car

Here out of 6 , Each are not useful and not  equally important . You should always choose most effecting factors in featured selection .This is also called dimension reduction  . You can only do if you good domain knowledge of your Project .

3.Problem of over fitting and under fitting of data in Machine Learning Data sets-

Over fitting is the situation where model response well on cross validation over training data set and become less efficient in real time .This happens when there are so much noisy data in your data set .Lets understand it with an easy example .Suppose you  have the data about countries population and GDP with  other financial ratios . Now you are predicting GDP on the basis of population .Suppose you use deep learning here , May be system learn country who are starting with a special letter has higher GDP . Now wrong pattern will give you wrong prediction.

I know , ” What are you thinking at this time ? ” .  You have understood the problem of over fitting in machine learning . Now you are thinking about the solution . Right!

Regularization  as the solution of over fitting –

If you model has so many parameter , Try to minimize as much as possible .   This will train your machine in right direction . In short this process is called regularization . If you want to read more on over fitting , You may refer the article by Analytical Vidya -” How to avoid over-fitting using regularization “.

Lets discuss under-fitting , It happens when model is quite straight but real problem is influenced by more parameters . In easy words , It is opposite case of over -fitting .

4. Data Sampling –

When you are preparing machine learning datasets , It should cover all cases . I mean, Suppose you have data about counties in which there are 10000 records . In which you have 80% data about Asian countries . Then it is biased dataset . It should contain all  data equally distributed .Try to avoid such practices in designing machine learning datasets.

I think you have understood the Data set designing essentials in Machine Learning .  Along with these principal of designing data set , If you know High performance architecture for Database design , It will be a Bonus for you.  For learning Data science and machine learning essentials you must need practice data set . You may get plenty amount of free data set from-

1.UC irvine Machine Learning Repository for machine learning .

2. Kaggle Datasets

3. AWS Datasets for Machine learning.

End Notes-

I will share a very popular research for you , Microsoft researchers Eric and Michele showed how quantity of training data set is important for machine learning .

Machine learning datasets
Machine learning datasets

This figure shows how test accuracy is increasing with increasing data sets quantity   .  You can find full research paper here . So next time when you design the machine learning data sets , keep these  designing principle in mind .

Hey last minute tip for you on data set is use 80% part of data  in training . Rest 20% for cross validation of you machine learning model  . Remember it is recommended only you can have 70-30 ratio also.

If you find this article helpful please comment us on comment box . you suggestions are also welcome . If you want to stay in touch with these Data science related article you can also subscribe our newsletter  .

Data Science Learner Team

Icon made by FreePik from
Icons made by Smashicons from is licensed by CC 3.0 BY

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages, where he and his team share knowledge and help others learn more about data science.
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner