Top 25+ Datasets for Machine Learning and Statistics Projects : In 2021

Datasets for Machine Learning and Statistics Projects

Are you looking to build a machine learning and AI-based Intelligent app? You must need a huge amount of datasets to train your model. Mostly a machine learning project fails not because of the model and infrastructure but poor datasets . Especially the beginner who just started with data science wastes a lot of time in searching the best Datasets for machine learning projects. To help them out and save their valuable time, We have designed this article which includes a chain of data source links from where you can download Datasets for machine learning projects and start a machine learning project.

Even if you are not a beginner, I will strongly recommend you read it fully. You must be thinking why? See, If you are anyhow associated with the analytics Industry. You must need these datasets. In fact, if you do not want to read it fully right now. You may bookmark it as a data scientist I always bookmark the evergreen article related to analytics Industry. As a result, If I need to access that, I can access it at any point in time.

Secondly the most of the important thing here is the variety of the dataset collection.

Datasets repositories for machine learning and statistics projects-

[toc]

Here is the list of data sources. Most noteworthy, Every data set has its own properties and specification so you need to track them.

1. Open Dataset For Machine Learning-

Firstly we will cover the open domain repository for best public datasets for machine learning and data science.

 UCI Machine Learning Repository –

Datasets for machine learning projects
Datasets for machine learning projects

This Repository contains data about various domains. For example – UCI contains the dataset of car evaluation to Credit Approval. At the time of writing this article, UCI contains 433 different domain data sets. You will get the variety in data set design  I mean few of them are labeled (Classification) , few are for clustering, etc . You know what I like most about this repository is website navigation. If you open the website, You will see on left there are so many parameters on which you can filter the datasets.

 Kaggle

Sometimes I found Kaggle is a complete plant for data science . In Kaggle you will get the data sets , kernel and team for discussion  . Here you can create and donate your own data set with community. The best part of Kaggle, You will not only get the traditional data but here you will get the amazing interesting data set some time based on movies like – Titanic .

datasets for machine learning pojects kaggle
datasets for machine learning projects kaggle

Usually, in data science, It is a mandatory condition for data scientists to understand the data set deeply. In that case, if you are a beginner and get totally unknown domain and data set for learning. Therefore, It is going to be a big challenge. In Kaggle you will get such data set on which you have already prior information. Once you learn data science technology, then you can switch to any other domain.

AWS datasets-

Amazon also provides a big range of machine learning datasets. You can use and analyze this machine learning dataset on your local computer or cloud services provided with AWS . For beginner ease, AWS provides “how-to articles” on every operation related to datasets with examples.

Datasets for machine learning aws
Datasets for machine learning was

SOCR Height and Weight Dataset

If you want to build machine learning projects on the Body Mass Index(BMI) then this dataset can be useful for you. It has 25,000 records of weights of the people according to their height.

SOCR Data Dinov 020108 HeightsWeights Dataset Offical Page

 

Titanic Dataset

You must have seen the movie Titanic the Ship that sank on 15th April 1912 killing 1502 passengers out of 2224. It has information like name, age, sex,

the number of siblings, e.tc for both the training and test.

Titanic Dataset from Stanford Offical Website
Titanic Dataset from Stanford Offical Website

 

Boston Housing Dataset  (public datasets for machine learning)

This dataset contains housing prices of the Boston City based on features like crime rate, number of rooms, taxes, e.t.c. It has 506 rows and 14 variables or columns. Boston housing dataset is generally used for pattern reorganization. You can use it to build a model on linear regression to predict the prices of houses.

Boston Housing Dataset Offical Website
Boston Housing Dataset Offical Website

2. Public Government Datasets for Machine Learning

data.gov –

Generalize portal by USA government. It has datasets in various categories like agriculture, climate, Ecosystems, Energy, etc. At the time of writing this article, this data.gov portal has 190,277 datasets.

datasets for machine learning pojects dataGov
datasets for machine learning pojects data gov

Google BigQuery Public Datasets

Google provides Google Cloud which you can use as Infrastructure for your machine learning project. Along with it, google provide some datasets which are publicly available by the name of Google BigQuery Public datasets.

datasets for machine learning pojects Google Big query
datasets for machine learning pojects Google Big query

Google Trends Dataset

Who doesn’t know about Google Trends? It gives you the current trend for a particular Search term. You can download the datasets from it in an excel or CSV file and play with it. It is curated by the News Lab at the Google Team.

Google Trends Datastore Offical Website Image
Google Trends Datastore Offical Website Image

 

3. Finance & Economics Datasets

 World Bank DataSets-

You must know how much useful is world bank data. World Bank publishes international data about poverty and other index time by time. Using this portal you can get the Datasets for machine learning and statistics projects. Actually the data transmitter is a world bank so it has also so many filters like Regions and Countries,  Data Type, etc.

datasets for machine learning pojects world bank
datasets for machine learning pojects world bank

Quandl- 

It is a Finance biased dataset. It is clean, Therefore mostly Industry professionals use it. Data scientists working for Investment banking and hedge funds make the recommended system on the top of this dataset.

datasets for machine learning pojects Quandl
datasets for machine learning pojects Quandlhttp://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/slr/frames/slr06.html

International Monetary Fund (IMF)  Dataset 

It allows to access and download the finances related data for free. It has the dataset for international finances, debt, bond, foreign exchange reserves, investments, commodities, credits e.t.c.

International monetary fund official website
International monetary fund official website

Financial Times Market Dataset 

It covers the data for the stock markets, indices, bonds, and foreign exchange markets of the entire world.

Financial Times Market Datasets Official Page
Financial Times Market Datasets Official Page

American Economic Association (AEA)

AEA dataset provides you all the Macroeconomic data like Inflation, GDP, CPI e.t.c for the United States.

Offical AES website
Offical AES website

Eurostat Comext Data

It gives you the dataset of the trade flows since 1998 for the commodity.  It is maintained by the European Union.

Eurostat Comext Official Website

4. Image Datasets

Five Thirty Eight Datasets (Github Repo)-

This is a GitHub repository where 538 datasets are maintained with their source. Here is the official website for Five thirty Eight datasets . More on you can say it is data story repo.

datasets for machine learning pojects 538 git
datasets for machine learning pojects 538 git

 

The MNIST dataset –

A very popular but very specific dataset. Actually It mainly contains the data for image recognization. This MNIST data set is mainly famous because of handwritten digits. It mainly contains 60000 instances for training dataset and 10000 for testing of HANDWRITTEN DIGITS.

datasets for machine learning pojects MNIST
datasets for machine learning projects MNIST

The Chars74K dataset

Character recognization is one of the interesting problem areas in computer vision and classification. Chars74K contains a large labeled dataset for character recognition.

datasets for machine learning pojects charks74k
datasets for machine learning pojects charks74k

ImageNet dataset

ImageNet is a large database of images currently organized according to the Wordnet hierarchy. Currently, it has more than 100,000 phrases and each phrase has 1000 images making it 150 GB+ image database. Using this dataset you can build many projects like image recognition, face recognition, object detection, etc. and it perfectly works for CNN (Convolutional neural networks)  models.

imagenet machine learning dataset website image
imagenet machine learning dataset website image

Dogs Breed Dataset

If you want to build projects on dog classification then this dataset is for you. It is created by Stanford. It contains images of 120 breeds of dogs around the world.

Stanford Dogs Dateset Official Page
Stanford Dogs Dataset Official Page

5. Entertainment Dataset

YouTube Dataset-

If you want to do something with a video classification problem and looking for a video dataset. Here is good news for you. Google research group has recently launched a labeled dataset for 8M classified Yo

Offical AES website
Offical AES website

uTube Videos.

datasets for machine learning pojects youtube
datasets for machine learning pojects youtube

MovieLens-

If you want to build a movie recommendation system based on client or end-user behavior and preference. This MovieLens dataset is best for you.

datasets for machine learning pojects MovieLens
datasets for machine learning pojects MovieLens

Jester-  

As MovieLens is a movie dataset, Jester is Jokes dataset. It is mainly used for making Jokes a recommendation system. Please check it out firstly,  if you need to build something funny with machine learning.

datasets for machine learning pojects jester
datasets for machine learning pojects jester

6. Natural Language Processing( NLP) Datasets

Spam -SMS classifier Datasets –

It contains text classification data sets. I will recommend using if you are doing your first text analytics machine learning project.

datasets for machine learning pojects spam
datasets for machine learning projects spam

7. Sentiment Analysis Datasets

Twitter sentiment Analysis Datasets-

This dataset contains classified tweets into their sentiments . each row is a tweet and the target is sentiment. If one then it has positive sentiment otherwise negative sentiment at zero.As you already know sentiment analysis is rapidly used in the NLP industry.

Other Top Machine Learning Datasets-

Frankly speaking, It is not possible to put the detail of every machine learning data set in a single article. Therefore I decided to give a quick link for them. These are the top Machine Learning set. Firstly look at them –

1.Swedish Auto Insurance Dataset

2. Awesome Public  dataset

3. reddit dataset 

4. Enron Email Dataset

5. Chatbot dataset

6. Flikr Dataset

Must Read this Section (interesting machine learning datasets)  –

The most important thing which we should keep in mind while using these datasets is the License. Yes ! it is a License. Most of the above mention machine learning datasets repositories are free. Still, there could be some hidden information in this Guess what? Usually, things are open for non-commercial usages. When you are making any product or service and charging end-user, Things are different.

Firstly, I will recommend you to make a habit of reading all the dependencies and external files which you use in your product. Otherwise, anyone can sue you. So be careful!

Conclusion (dataset for ml project) –

In conclusion , I also agree when you work in the analytics Industry for a particular corporate, You mostly build the predictive model or something else for their own system. In that, you use their own data. For example, if you work for amazon and there you need to build a recommendation engine. In such type of scenario, you always use their data. Right! Actually this is a very specific case . let’s talk more generalize. Suppose you are a student or researcher on machine learning or you want to build something or you want to test anything on dummy data. Therefore, you need these data sources.

Most Importantly for a beginner in data science, UCI machine learning repository and KAGGLE is sufficient.  So friends! I have mentioned most of the important and useful dataset sources for you. While If you think anything is missing please comment below.

 

Infographic

Datasets for machine learning and statistics projects
Datasets for machine learning and statistics projects Infographic

Share this Image On Your Site ( New Infographic Coming Soon)

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner