Are you looking to build a machine learning and AI-based Intelligent app? You must need a huge amount of datasets to train your model. Mostly a machine learning project fails not because of the model and infrastructure but poor datasets . Especially the beginner who just started with data science wastes a lot of time in searching the best Datasets for machine learning projects. To help them out and save their valuable time, We have designed this article which includes a chain of data source links from where you can download Datasets for machine learning projects and start a machine learning project.
Even if you are not a beginner, I will strongly recommend you read it fully. You must be thinking why? See, If you are anyhow associated with the analytics Industry. You must need these datasets. In fact, if you do not want to read it fully right now. You may bookmark it as a data scientist I always bookmark the evergreen article related to analytics Industry. As a result, If I need to access that, I can access it at any point in time.
Secondly the most of the important thing here is the variety of the dataset collection.
Datasets repositories for machine learning and statistics projects-[toc]
Here is the list of data sources. Most noteworthy, Every data set has its own properties and specification so you need to track them.
1. Open Dataset For Machine Learning-
Firstly we will cover the open domain repository for best public datasets for machine learning and data science.
This Repository contains data about various domains. For example – UCI contains the dataset of car evaluation to Credit Approval. At the time of writing this article, UCI contains 433 different domain data sets. You will get the variety in data set design I mean few of them are labeled (Classification) , few are for clustering, etc . You know what I like most about this repository is website navigation. If you open the website, You will see on left there are so many parameters on which you can filter the datasets.
Sometimes I found Kaggle is a complete plant for data science . In Kaggle you will get the data sets , kernel and team for discussion . Here you can create and donate your own data set with community. The best part of Kaggle, You will not only get the traditional data but here you will get the amazing interesting data set some time based on movies like – Titanic .
Usually, in data science, It is a mandatory condition for data scientists to understand the data set deeply. In that case, if you are a beginner and get totally unknown domain and data set for learning. Therefore, It is going to be a big challenge. In Kaggle you will get such data set on which you have already prior information. Once you learn data science technology, then you can switch to any other domain.
Amazon also provides a big range of machine learning datasets. You can use and analyze this machine learning dataset on your local computer or cloud services provided with AWS . For beginner ease, AWS provides “how-to articles” on every operation related to datasets with examples.
If you want to build machine learning projects on the Body Mass Index(BMI) then this dataset can be useful for you. It has 25,000 records of weights of the people according to their height.
You must have seen the movie Titanic the Ship that sank on 15th April 1912 killing 1502 passengers out of 2224. It has information like name, age, sex,
the number of siblings, e.tc for both the training and test.
This dataset contains housing prices of the Boston City based on features like crime rate, number of rooms, taxes, e.t.c. It has 506 rows and 14 variables or columns. Boston housing dataset is generally used for pattern reorganization. You can use it to build a model on linear regression to predict the prices of houses.
2. Public Government Datasets for Machine Learning
Generalize portal by USA government. It has datasets in various categories like agriculture, climate, Ecosystems, Energy, etc. At the time of writing this article, this data.gov portal has 190,277 datasets.
Google provides Google Cloud which you can use as Infrastructure for your machine learning project. Along with it, google provide some datasets which are publicly available by the name of Google BigQuery Public datasets.
Who doesn’t know about Google Trends? It gives you the current trend for a particular Search term. You can download the datasets from it in an excel or CSV file and play with it. It is curated by the News Lab at the Google Team.
3. Finance & Economics Datasets
You must know how much useful is world bank data. World Bank publishes international data about poverty and other index time by time. Using this portal you can get the Datasets for machine learning and statistics projects. Actually the data transmitter is a world bank so it has also so many filters like Regions and Countries, Data Type, etc.
It is a Finance biased dataset. It is clean, Therefore mostly Industry professionals use it. Data scientists working for Investment banking and hedge funds make the recommended system on the top of this dataset.
It allows to access and download the finances related data for free. It has the dataset for international finances, debt, bond, foreign exchange reserves, investments, commodities, credits e.t.c.
It covers the data for the stock markets, indices, bonds, and foreign exchange markets of the entire world.
AEA dataset provides you all the Macroeconomic data like Inflation, GDP, CPI e.t.c for the United States.
It gives you the dataset of the trade flows since 1998 for the commodity. It is maintained by the European Union.
4. Image Datasets
This is a GitHub repository where 538 datasets are maintained with their source. Here is the official website for Five thirty Eight datasets . More on you can say it is data story repo.
A very popular but very specific dataset. Actually It mainly contains the data for image recognization. This MNIST data set is mainly famous because of handwritten digits. It mainly contains 60000 instances for training dataset and 10000 for testing of HANDWRITTEN DIGITS.
Character recognization is one of the interesting problem areas in computer vision and classification. Chars74K contains a large labeled dataset for character recognition.
ImageNet is a large database of images currently organized according to the Wordnet hierarchy. Currently, it has more than 100,000 phrases and each phrase has 1000 images making it 150 GB+ image database. Using this dataset you can build many projects like image recognition, face recognition, object detection, etc. and it perfectly works for CNN (Convolutional neural networks) models.
If you want to build projects on dog classification then this dataset is for you. It is created by Stanford. It contains images of 120 breeds of dogs around the world.
5. Entertainment Dataset
If you want to do something with a video classification problem and looking for a video dataset. Here is good news for you. Google research group has recently launched a labeled dataset for 8M classified Yo
If you want to build a movie recommendation system based on client or end-user behavior and preference. This MovieLens dataset is best for you.
As MovieLens is a movie dataset, Jester is Jokes dataset. It is mainly used for making Jokes a recommendation system. Please check it out firstly, if you need to build something funny with machine learning.
6. Natural Language Processing( NLP) Datasets
It contains text classification data sets. I will recommend using if you are doing your first text analytics machine learning project.
7. Sentiment Analysis Datasets
This dataset contains classified tweets into their sentiments . each row is a tweet and the target is sentiment. If one then it has positive sentiment otherwise negative sentiment at zero.As you already know sentiment analysis is rapidly used in the NLP industry.
Other Top Machine Learning Datasets-
Frankly speaking, It is not possible to put the detail of every machine learning data set in a single article. Therefore I decided to give a quick link for them. These are the top Machine Learning set. Firstly look at them –
Must Read this Section (interesting machine learning datasets) –
The most important thing which we should keep in mind while using these datasets is the License. Yes ! it is a License. Most of the above mention machine learning datasets repositories are free. Still, there could be some hidden information in this Guess what? Usually, things are open for non-commercial usages. When you are making any product or service and charging end-user, Things are different.
Firstly, I will recommend you to make a habit of reading all the dependencies and external files which you use in your product. Otherwise, anyone can sue you. So be careful!
Conclusion (dataset for ml project) –
In conclusion , I also agree when you work in the analytics Industry for a particular corporate, You mostly build the predictive model or something else for their own system. In that, you use their own data. For example, if you work for amazon and there you need to build a recommendation engine. In such type of scenario, you always use their data. Right! Actually this is a very specific case . let’s talk more generalize. Suppose you are a student or researcher on machine learning or you want to build something or you want to test anything on dummy data. Therefore, you need these data sources.
Most Importantly for a beginner in data science, UCI machine learning repository and KAGGLE is sufficient. So friends! I have mentioned most of the important and useful dataset sources for you. While If you think anything is missing please comment below.
Share this Image On Your Site ( New Infographic Coming Soon)
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.