Are you looking to build a machine learning and AI based Intelligent app? You must need a huge amount of datasets to train your model . Mostly a machine learning project fails not because of the model and infrastructure but poor datasets . Specially the beginner who just started with data science waste lot of time in searching the best Datasets for machine learning projects . To help them out and save their valuable time , We have designed this article which include a chain of data source links from where you can download Datasets for machine learning projects and start a machine learning project .
Even if you are not a beginner , I will strongly recommend you read it fully . You must be thinking why ?See , If you are any how associated with analytics Industry . You must need these datasets . In fact if you do not want to read it fully right now . You may book mark it as a data scientist I always book mark the evergreen article related to analytics Industry . As a result , If I need to access that , I can access at any point of time .
Table of Contents
- 0.1 UCI Machine Learning Repository –
- 0.2 Kaggle –
- 0.3 AWS datasets-
- 0.4 data.gov –
- 0.5 World Bank DataSets-
- 0.6 Five Thirty Eight Datasets (Github Repo)-
- 0.7 The MNIST dataset –
- 0.8 The Chars74K dataset–
- 0.9 Google BigQuery Public Datasets–
- 0.10 YouTube Dataset-
- 0.11 Spam -SMS classifier Datasets –
- 0.12 Twitter sentiment Analysis Datasets-
- 0.13 MovieLens-
- 0.14 Jester-
- 0.15 Quandl-
- 1 Other Useful dataset sources –
- 2 Must Read this Section –
Datasets for machine learning and statistics projects-
Here is the list of data sources . Most noteworthy , Every data set has its own properties and specification so you need to track them .
This Repository contains the data about various domain . For example – UCI contains the dataset of car evaluation to Credit Approval .At the time of writing this article , UCI contain 433 different domain data sets .You will get the variety in data set design I mean few of them are labeled (Classification) , few are for clustering etc . You know what I like most about this repository is the website navigation . If you open the website , You will see on left there are so many parameter on which you can filter the datasets .
Some time I found Kaggle is a complete plant for data science . In kaggle you will get the data sets , kernal and team for discussion . Here you can create and donate your own data set with community .The best part of kaggle , You will not only get the traditional data but here you will get the amazing interesting data set some time based on movies like – Titenic .
Usually in data science , It is a mandatory condition for data scientist to understand the data set deeply . In that case if you are a beginner and get totally unknown domain and data set for learning . Therefore ,It is going to be a big challenge . In kaggle you will get such data set on which you have already prior information .Once you learn the data science technology , then you can switch to any other domain.
Amazon also provide a big range of machine learning datasets. You can use and analyse this machine learning dataset on your local computer or cloud services provided with AWS .For beginner ease , AWS provide “how to articles” on every operation related to datasets with examples .
Generalize portal by USA government . It has the datasets in various catagories like agriculture , climate , Ecosystems, Energy etc . At the time of writing this article this data.gov partal has 190,277 datasets .
You must know how much useful is world bank data . World bank publish international data about poverty and other index time by time . Using this portal you can get the Datasets for machine learning and statistics projects. Actually data transmitter is world bank so it has also so many filters like Regions and Countries , Data Type etc.
This is a github repository where 538 datasets are maintained with their source . Here is the official website for Five thirty Eight datasets .More on you can say it is data story repo .
A very popular but very specific dataset .Actually It mainly contains the data for image recognization . This MNIST data set is mainly famous because of handwritten digits . It mainly contains 60000 instance for training dataset and 10000 for testing of HAND WRITTEN DIGITS .
Chararcter recognization is one of the interesting problem area in computer vision and classification . Chars74K contains large labled dataset for character recognition.
Google provides Google Cloud which you can use as Infrastructure for your machine learning project .Along with it google provide some datasets which are publicly available by the name of Google BigQuery Public datasets .
If you want to do something with video classification problem and looking for video dataset . Here is a good news for you . Google research group has recently launched labeled dataset for 8M classified YouTube Videos .
It contains text classification data sets . I will recommend to use if you are doing your first text analytics machine learning project.
Twitter sentiment Analysis Datasets-
This dataset contains classified tweets into their sentiments . each row is a tweet and target is sentiment . If one then it has positive sentiment otherwise negative sentiment at zero .As you already know sentiment analysis is rapidly used in NLP industry .
If you want to build a movie recommendation system based on client or end user behavior and preference .This MovieLens dataset is best for you .
As MovieLens is movie dataset , Jester is Jokes dataset . It is mainly used for making Jokes recommendation system. Please check it out if you need to build something funny with machine learning .
It is Finance biased dataset .It is clean ,Therefore mostly Industry professional use it . Data scientist working for Investment banking and hedge funds make recommended system on the top of this dataset.
Other Useful dataset sources –
Frankly speaking , It is not possible to put the detail of every machine learning data set in a single article . Therefore I decide to give a quick link for them . Please check it out –
Must Read this Section –
The most important thing which we should keep in the mind while using these datasets is the License . Yes ! it is License . Most of the above mention machine learning datasets repositories are free . Still there could be some hidden information in this Guess what ? Usually things are open for non commercial usages . When you are making any product or service and charging end user , Things are different .
As per best of my knowledge , I will recommend you to make a habit of reading all the dependencies and external files which you use in your product . Otherwise anyone can sue you . So be careful !
I also agree when you work in analytics Industry for a particular corporate , You mostly build the predictive model or some thing else for their own system . In that you use their own data . For example if you work for amazon and there you need to build a recommendation engine . In such type of scenario you always use their data .Right ! Actually this is very specific case . Lets talk more generalize. Suppose you are a student or researcher on machine learning or you want to build something or you want to test anything on dummy data. Therefore , you need these data sources .
Most of the time for beginner in data science , UCI machine learning repository and kaggle is sufficient . So friends ! I have mentioned most of the important and useful dataset sources for you . While If you think any thing is missing please comment below .
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.