Where can I find Data set for Machine Learning purposes?

Where can I find Data set for Machine Learning purposes?

I am currently doing a research in the attempt to detect onset sign of mental disorders through facial recognition. In order to do that, I would need training dataset which contains 1. Pictures 2. Diagnosis. Are there any open datasets that I could find/apply to? Thanks

How can you use these sources?

There is no end to how you can use these data sources. The application and usage is only limited by your creativity and application.

The simplest way to use them is to create data stories and publishing them over web. This would not only improve your data and visualization skills, but also improve your structured thinking.

On the other hand, if you are thinking / working on a data based product, these datasets could add power to your product by providing additional / new input data.

So, go ahead, work on these projects and share them with the larger world to showcase your data prowess!

I have divided these sources in various sections to help you categorize data sources based on application. We start with simple, generic and easy to handle datasets and then move to huge / industry relevant datasets. We then provide links to dataset for specific purpose – Text Mining, Image classification, Recommendation engine etc. This should provide you a holistic list of data resources.

If you can think of any application of these datasets or know of any popular resources which I have missed, please feel free to share them with me in the comments below.

Simple & Generic datasets to get you started

  • – This is the home of the U.S. Government’s open data. The site contains more than 190,000 data points at time of publishing. These datasets vary from data about climate, education, energy, Finance and many more areas.

  • – This is the home of the Indian Government’s open data. Find data by various industries, climate, health care etc. You can check out a few visualizations for inspiration here. Depending on your country of residence, you can also follow similar websites from a few other websites – check them out.
  • World Bank – The open data from the World bank. The platform provides several tools like Open Data Catalog, world development indices, education indices etc.
  • RBI – Data available from the Reserve Bank of India. This includes several metrics on money market operations, balance of payments, use of banking and several products. A must go to site, if you come from BFSI domain in India.
  • Five Thirty Eight Datasets – Here is a link to datasets used by Five Thirty Eight in their stories. Each dataset includes the data, a dictionary explaining the data and the link to the story carried out by Five Thirty Eight. If you want to learn how to create data stories, it can’t get better than this.

Huge Datasets – things are getting serious now!

  • Amazon Web Services (AWS) datasetsAmazon provides a few big datasets, which can be used on their platform or on your local computers. You can also analyze the data in the cloud using EC2 and Hadoop via EMR. Popular datasets on Amazon include full Enron email dataset, Google Books n-grams, NASA NEX datasets, Million Songs dataset and many more. More information can be found here.
  • Google datasets – Google provides a few datasets as part of its Big Query tool. This includes baby names, data from GitHub public repositories, all stories & comments from Hacker News etc.
  • Youtube labeled Video Dataset
    A few months back, Google Research Group released YouTube labeled dataset, which consists of 8 million YouTube video IDs and associated labels from 4800 visual entities. It comes with pre-computed, state-of-the-art vision features from billions of frames.

Datasets for predictive modeling & machine learning:

  • UCI Machine Learning RepositoryUCI Machine Learning Repository is clearly the most famous data repository. It is usually the first place to go, if you are looking for datasets related to machine learning repositories. The datasets include a diverse range of datasets from popular datasets like Iris and Titanic survival to recent contributions like that of Air Quality and GPS trajectories. The repository contains more than 350 datasets with labels like domain, purpose of the problem (Classification / Regression). You can use these filters to identify good datasets for your need.
  • Kaggle Kaggle has come up with a platform, where people can donate datasets and other community members can vote and run Kernel / scripts on them. They have more than 350 datasets in total – with more than 200 as Featured datasets. While some of the initial datasets were usually present at other places, I have seen a few interesting datasets on the platform, not present at other places. Along with new datasets, another benefit of the interface is that you can see scripts and questions from community members on the same interface.
  • Analytics Vidhya You can participate and download datasets from our practice problems and hackathon problems. The problem datasets are based on real-life industry problems and are relatively smaller as they are meant for 2 – 7 days hackathons. While practice problems are available to people always, the hackathon problems become unavailable after the hackathons. So, you need to participate on the hackathon to get access to the datasets.
  • QuandlQuandl provides financial, economic and alternative data from various sources through their website / API or direct integration with a few tools. Their datasets are classified as Open or Premium. You can access all the open datasets for Free, but you need to pay for the premium datasets. If you search, you still get good datasets on the platform. Eg. Stock Exchange data from India is available for free.
  • Past KDD CupsKDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining. Archives includes datasets and instructions. Winners are available for most years.
  • Driven DataDriven Data finds real-world challenges where data science can be used to create a positive social impact. They then run online modeling competitions for data scientists to develop the best models to solve them. If you are interested in use of data science for social good – this is the place to be.

Image classification datasets

  • The MNIST Database – The most popular dataset for image recognition using hand-written digits. It includes 60,000 train examples and a test set of 10,000 examples. This serves as typically the first dataset to practice image recognition.
  • Chars74K– Here is the next level of evolution, if you have passed hand written digits. This dataset includes character recognition in natural images. The dataset contains 74,000 images and hence the name of the dataset.
  • Frontal Face Images If you have worked on previous 2 projects and are able to identify digits and characters, here is the next level of challenge in Image recognition – Frontal Face images. The images were collected by CMU & MIT and are arranged in four folders.
    Time to build something generic now. Image database organised according to the WordNet hierarchy (currently only the nouns). Each node of the hierarchy is depicted by hundreds of images. Currently, the collection has an average of over five hundred images per node (and increasing).

Text Classification datasets

  • Twitter Sentiment Analysis The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment. The data is in turn based on a Kaggle competition and analysis by Nick Sanders.

Datasets for Recommendation Engine

  • MovieLens MovieLens is a web site that helps people find movies to watch. It has hundreds of thousands of registered users. They conduct online field experiments in MovieLens in the areas of automated content recommendation, recommendation interfaces, tagging-based recommenders. These datasets are available for download and can be used to create your own recommender systems.

Websites which Curate list of datasets from various sources:

  • KDNuggets – The dataset page on KDNuggets has long been a reference point for people looking for datasets out there. A really comprehensive list, however some of the sources no longer provide the datasets. So, you will need to apply your own prudence on the datasets and the sources.
  • Awesome Public Datasets A GitHub repository with a comprehensive list of datasets categorized by domain. Datasets are classified neatly in various domains, which is very helpful. However, there is no description about the datasets on the repository itself – which could have made it very useful.
    Since this is a community driven forum, it might come across a bit messy (compared to previous 2 sources). However, you can sort datasets by popularity / votes to see the most popular ones. Also, it has some interesting datasets and discussions.

Practice On Small Well-Understood Datasets

There are hundreds of standard test datasets that you can use to practice and get better at machine learning.

Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small.

This last point is critical when practicing machine learning because:

  • You can download them fast.
  • You can fit them into memory easily.
  • You can run algorithms on them quickly.

Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post:

Access Standard Datasets in R

You can load the standard datasets into R as CSV files.

There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN).

Which libraries should you use and what datasets are good to start with.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Interesting Data Sets

If, tomorrow, you get an email congratulating you on your new status as future Jeopardy contestant, how are you going to prepare? Well, one approach might be to download this archive of 216,930 past Jeopardy questions and plug them into your favorite spaced repetition system. Combine that with reading up on Jeopardy betting strategies, and you’re well on your way to becoming the next Arthur Chu (except hopefully nicer).

Ever get a morbid curiosity about what it’s like to be on death row? (Yeah, me neither.) But in case you ever have, Texas has graciously placed the last words of every inmate executed since 1984 online. So… sentiment analysis, anyone? (“How upbeat are death row inmates days before execution? With a little help from some data, we found out!”)

Speaking of prison, there’s more data on prisoners, including information about “their current offense and sentence, criminal history, family background and personal characteristics, prior drug and alcohol use and treatment programs, gun possession and use, and prison activities, programs, and services” available here.

How about reading other people’s emails? Ever wanted to do that, but can’t be bothered to train l33t hacking skills (and never mind the legality of it)? (Okay, this one I have thought about.) Well, I’ve got you covered. Check out the Enron corpus. It contains more than half a million emails from about 150 users, mostly senior management of Enron, organized into folders. Wikipedia calls it “unique in that it is one of the only publicly available mass collections of ‘real’ emails easily available for study.” Business idea: figure out what sort of information gets leaked in the emails that will later harm the execs at trial or whatever, then build a software system to automatically mine those out of real email. Either sell it to law enforcement or to corporate executives as the finest cover-your-ass email system.

Wondering what the internet really cares about? Well, I don’t know about that, but you could answer an easier question: What does Reddit care about? Someone has scraped the top 2.5 million Reddit posts and then placed them on GitHub. Now you can figure out (with data!) just how much Redditors love cats. Or how about a data backed equivalent of r/circlejerk? (The original use case was determining what domains are the most popular.)

Speaking of cats, here are 10,000 annotated images of cats. This ought to come in handy whenever I get around to training a robot to exterminate all non-cat lifeforms. (Or, if you’re Google, you could just train a cat recognition algorithm and then send those users cat-specific advertising.)

    If you’re interested in building financial algorithms or, really, just predicting arbitrage opportunities for one of America’s largest cash crops, check out this data set, which tracks the price of marijuana from September 2nd, 2010 until about the present.

The earliest recorded chess match dates back to the 10th century, played between a historian from Baghdad and a student. Since then, it’s become a tradition for moves to be recorded – especially if a game has some significance, like a showdown between two strong players. As a consequence, today, students of the game benefit from one of the richest data sets of any game or sport. Perhaps the best freely available data set of games is known as the “Million Base,” boasting some 2.2 million matches. You can download it here. I can imagine an app that calculates your chess fingerprint, letting you know what grandmaster your play is most similar to, or an analysis of how play style has changed over time.

On the topic of games, for soccer fans, I recently came across this freely available data set of soccer games, players, teams, goals, and more. If that’s not enough, you can grab even more data via this Soccermetrics API python wrapper. I imagine that this could come in handy for coaches attempting to get an edge over opponent teams and, more generally, for that cross-section between geeks and gamblers attempting to build analytic models to make better bets.

Google has put made all their Google Books n-gram data freely available. An n-gram is an n word phrase, and the data set includes 1-grams through 5-grams. The data set is “based originally on 5.2 million books published between 1500 and 2008.” I can imagine using it to determine the most overused, cliche phrases, and those phrases that are in danger of becoming cliched. (Quick! Someone register the domain!)

Amazon has a number of freely available data sets (although I think you need to run your analysis on top of their cloud, AWS), including more than 2.8 billion webpages courtesy Common Crawl. The possibilities are endless, but an old business idea I had: analyze the Common Crawl data and determine cheap or not-currently-registered domains which are, for whatever reason, linked to buy many websites. Buy these up and then resell them to people involved in SEO. (Or you could, you know, try to build the next Google.)

How well do minorities do on the computer science advanced placement exam? You can find out and tell me.

There’s the Million Song data set, which contains information about a million different songs, including a metric “danceability.” Might be nice to pair that with a media player specialized for parties — start with “conversation” music, and slowly shift to more danceable stuff as the night drags on. The data could also be used for a clustering algorithm (automatic genre detection, maybe), but I’m not sure how useful that’d be. A number of people have tried to build recommendation algorithms based on the data, including Kagglers and a team from Cornell. One possible use: analyzing music by year — How danceable, fast, etc. were the 70s? 80s? 90s? (Or how about looking for a follow-the-leader effect. If one song goes viral with a unique style, do a bunch of copycats follow?)

Speaking of music data sets, has music data available. Collected from

360,000 users, it’s in the form of “user, artists, ## of plays”. This would be good for clustering algorithms that automatically determine label genre or recommender systems. (Even a “this artist is most similar to” thing would be sorta cool.)

When I think geeks, I think math and computer geeks, but there are many more. Terry Pratchett geeks (dated one!), Whovians, anime geeks, theater geeks and, with some relevance to this next data set, comic book geeks. Cesc Rosselló, Ricardo Alberich, and Joe Miro have put together a “social graph” of the Marvel Universe, and the data is freely available. Ideas for use: Maybe it could be overlaid on Facebook’s social graph to produce a new take on the “What superhero are you?” quiz.

Yelp has a freely available subset of their data, including restaurant rankings and reviews. One business idea: use tweets to predict restaurant star ratings. This would enable you to build out a Yelp competitor without requiring an active user base — you could just mine Twitter for data!

If you’re interested in data about data (metadata!), Jürgen Schwärzler, a statistician from Google’s public data team, has put together a list of the most frequently searched for data. The top 5 are school comparisons, unemployment, population, sales tax, and salaries. I was surprised that school comparisons were number 1 but, then again, I don’t have any brats running around (yet?). This list would be a good first step in researching what sort of data comparisons people actually care about.

Some of my readers are, no doubt, evil geniuses. Others want to save the world. There’s a subset of both of these groups who are interested in superintelligent robots. But to build such a robot, you’re going to have to teach it facts. All the things we take for granted, like that every person has one father. It would be a pain to insert those 10 million facts by hand (and, at a fact a minute, take more than 19 years). Thankfully, Freebase has done part of the job for you, making more than 1.9 billion facts freely available.

Maybe your plans are slightly less ambitious. You don’t want to build a superintelligent machine, just one smarter than your run of the mill mathematician. If that’s the case, you’re going to need to teach your machine a lot about mathematics, probably in the form of proofs and theorems. In that case, check out the Mizar project, which has formalized more than 9400 definitions and 49000 theorems.

And let’s say you build this mathematician and, sure, it can help you with proofs, but so what? You long for someone you can connect with on a deeper level. Someone who can summarize any topic imaginable. In that case, you might want to feed your robot on Wikipedia data. While all of Wikipedia is freely available, DBpedia is an attempt to synthesize it into a more structured format.

Now, you get tired of mathematics and Wikipedia. It turns out that proofs don’t pay the bills, so instead you decide to become a software engineer. Somehow, though, you’ve managed to build these machines without ever a rudimentary understanding of programming, and you want a machine that will teach it to you. But where to find the data for such a thing? You might start with downloading all 7.3 million StackOverflow questions. (Actually, all the StackExchange data is freely available, so you could feed it more math information from both MathOverflow and the other math stackexchange. Plus statistics from Cross Validated, and so on.)

Ever wanted to study true friendship? (C’mon! Free your inner child social scientist.) Y’know, genuine, platonic love, like the kind embodied by dolphins? Well, now you can! All thanks to your humble author and Mark Newman, who’s placed a network of “frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand.” Business idea: Flippr. It’s like Facebook, but for dolphins, with plans to expand into emerging whale and sea turtle markets. Most revenue will come from sardine sales.

Do left-leaning blogs more often link to other left-leaning blogs than right-leaning ones? Well, I don’t know, but it sounds reasonable. And, thanks to permission from Lada Adamic, you can download her network of hyperlinks between weblogs on US politics, recorded in 2005. (Or you could just read her paper. Spoilers: conservatives more freely link to other conservatives than liberals link to liberals so, if you’re interested in link building, maybe you should register Republican. 1

Who’s friendlier: the average jazz musician or the average dolphin? You could find out by combining the dolphin data set mentioned earlier with Pablo M. Gleiser and Leon Danon’s jazz musicians network data set.

What about 1930s southern women or prisoners? Who’s friendlier? How about fraternity members or HAM radio operators? All this and more can be figured out with these network data sets.

Web 2.0 websites (like Reddit) are sometimes gamed by “voting rings,” which are groups of people that intentionally vote up each other’s content, regardless of quality. I’ve often wondered if the same thing happens in academic circles. Like, you know, one night during your first year in grad school, you’re kidnapped in the middle of the night and made to swear a blood oath that you’ll cite every other member of the club. Or something. Well, Stanford has put online Arxiv’s High Energy Physics paper citation network, so you could find out.

You read this blog, so you’re pretty smart, right? And maybe you’d like to be rich, you know, so you can found the next Bill and Melinda Gates Foundation and save the world. (Because that’s why you want to be rich, right?) Well, then maybe you ought to develop some new-fangled trading algorithm and pick up like a trillion pennies from in front of the metaphorical steam-roller that is the market. (Quantitative finance!) But, in such a case, you’d better at least test your strategy on historical market data. Market data which you can get here.

The Open Product Data website aims to make barcode data available for every brand for free. Business idea: a specialty tattoo parlor that only does barcode tattoos, but lets customers pick whatever product they want. Think about it: “What’s your tattoo mean?” “It’s a Twinkie barcode, because Twinkies last forever, man, just like my faith.”

The European Center for Medium-Range Weather Forecasts has an impressive looking collection of weather data. Why, you ask, does the weather matter? The economic incentives for predicting the weather are absurd. When should you plant crops? Plan a big event? Launch a space shuttle? Go deep sea fishing? But I want to talk about the most fun application of weather data I’m aware of: The financial industry. I have a lot of respect for finance, mostly because of the crazy stuff they do. The only practical application of neutrinos I’ve heard of, for instance, is “because finance.” Should your algorithm buy Indonesian sesame seed futures? With weather data, it might know.

If you need nutrition data about food, the USDA has you covered. Business idea: A phone application called, “Am I allergic to that?” Then, lobby for your state to pass some law regulating each school into buying a license of it for every student.

For a wordsmith, a good dictionary is indispensable, and when it comes to word data, you could do a lot worse than check out the freely available WordNet. WordNet has significant advantages over your run of the mill dictionary as it focuses on the structure of language, grouping words into “sets of cognitive synonyms (synsets), each expressing a distinct concept.” It also has some information about relationships, such as “a chair has legs.”

We’ve already established that some of you are evil geniuses, in which case, where are you going to build your secret lair? I mean, a volcano is pretty cool, but is it evil and genius enough for competing in today’s modern world? You know what the other evil geniuses don’t have? A secret base on a planet outside of the solar system. With NASA’s list, you can get busy commissioning someone to build you a base on KOI-3284.01. 2

The Federal Railroad administration keeps a list of “railroad safety information including accidents and incidents, inventory and highway-rail crossing data.” Someone (like the NY Times) could overlay this on a map of the United States and figure out if people in poor regions are more likely to be hit by trains or something.

If you need a database of comprehensive book data, perhaps to build a competitor to Goodreads or an online digital library, the Open Library allows people to freely download their entire database.

Who is the United States killing with drones? If you’re content with Pakistan specific data, there is a list of drone strikes available here.

If you’re interested in building a Papers2 competitor with support for automatically importing citation data (please do this), CrossRef metadata search might be a good place to check out.

Mnemosyne is a virtual flash card program that takes advantage of spaced repetition to maximize learning. (As you might recall, I’m a big fan of spaced repetition.) The project has been collecting user data for years, and gwern has graciously agreed to freely host the data for a few months. Perhaps one could run some sort of unsupervised learning algorithm over it and try to discover heretofore unknown information about human memory.

How much would it cost to hire Justin Bieber to play at your wedding? The fine lads at Priceconomics have figured out how much it would cost to hire your favorite band. You could take this data and calculate some sort of popularity to price ratio — What’s the most fame for your buck?

I’ve mentioned in a few of the other data sets just how lucrative it is to be able to better predict the stock market than everyone else. In 2011, researchers found that they could use data from twitter to do just that: they went through tweets, found one’s related to publicly traded companies, and then calculated a mood score. With this they write, ” We find an accuracy of 86.7% in predicting the daily up and down changes in the closing values of the DJIA.” A number of Twitter data sets are freely available here.

A 2014 paper by Clifford Winston and Fred Mannering reports that vehicle traffic costs the United States 100 billion dollars each year. 3 There’s money to be made, then, in routing traffic more efficiently. One way to do this would be to feed an algorithm historical traffic data and then use that to predict hotspots, which you would route people around. Lots of that data is available on

On the other hand, if you were building an app to track current traffic data, you’ll need a different data source.

If you want to launch a spam-fighting service, or maybe just analyze what type of emails spammers are sending, you’ll need data. UC Irvine has you covered.

But maybe you want to extend your spam-fighting service to text messages. Still got you covered.

There is a wealth of data sets available for R and all you have to do is install a package. Ecdat is one of those packages, containing gobs of econometric data. How about an analysis of how math levels correlate with number of cigarettes smoked? I’d read that.

Ever wondered about how one person will be on the board of directors of several companies and it’s like, hey, maybe Condoleezza Rice with her ties to government surveillance isn’t the best choice for Dropbox? What if you could analyze those connections? Well, with this data set, you can. But only for Norway — it’s a network of the board members of public companies in Norway.

Ever seen a TV show where a government determines that someone is a terrorist based on their social ties? I always figured that data would be locked down tight somewhere, y’know, classified. But it turns out it isn’t. You, too, can analyze the social networks of terrorists.

There’s been a fair bit of controversy around all the bureaucracy of Wikipedia. But how does one become a bona fide Wikipedia big shot? Who’s the ideal Wikipedia administrator? Well, they’re voted for, and the data is available for download.

Harvard has opened up its set of “over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.”

GET-Evidence has put up public genomes for download. I think Steven Pinker’s data is in there someone. Maybe you could make yourself a clone?

Oh, and speaking of genomes, the 1000 Genomes project has made

260 terabytes of genome data downloadable.

In what is the smallest data set on this list, the survival rates of men and women on the Titanic. Female passengers were

4x times more likely to survive than male passengers.

Want an super specific breakdown of the contents of your food? metabolites in the human body. I’m not sure what you could do with it, but it might come in handy in some sort of dystopian future where humans are raised like cattle for their nutrients. (Maybe someone could use this to build a viral marketing campaign along the lines of, “How nutritious is your mom?”)

* The Reference Energy Disaggregation Data Set has about 500 GB of compressed data on home energy use. Obvious use candidates: improving home efficiency or creating a visualization of just where people’s energy bills are going.

    Invented a new image compression algorithm (Pied Piper, anyone?) and need data to test it on? Look no further than the CSAIL’s tiny image data set.

Or maybe tiny images are too tiny. In that case, try the ImageNet database, which is structured around the WordNet hierarchy. So if you want to teach an algorithm what a narwhal looks like, this would be a good place to start. (This coming from someone who’s sister thought narwhals were mythical until the age of 18.)

Still not enough? How about all the Wikipedia images?

Let’s say you’re building the next generation of book reader, and you want to automatically associate phrases with the relevant Wikipedia article. How? Stanford in association with Google Research has you covered with their English-phrase-to-associated-Wikipedia-article database. The research paper can be downloaded here.

Yandex, the Russian search engine, has made a bunch of search data available. Namely, if someone searches for something, what do they click on? Downsides: It’s a Russian search engine with Russian search results.

Just what kind of edits do people usually make on Wikipedia? I don’t know, but you can figure it out with this data set.

Did you know that Google has a search engine for data sets?

Pew Research has many free data sets, including their “Global Attitudes Project” archive. Questions this data could answer: Is the world becoming more progressive over time? How have attitudes towards religion shifted over time?

Speaking of public attitudes over time, you can download a set of the General Social Survey from 1972 until about 2012, which should answer both of those questions.

There’s a fun math problem called the celebrity problem, which asks you to find the person who everyone knows, but who knows nobody. But what about the real life celebrity problem? Try Yahoo’s collection of celebrity faces.

Need a billion webpages from February 2009? Maybe to train a never ending language learner named NELL? Yup, it’s available.

Did you know that you can download all the PDFs on Arxiv? Once we manage to teach machines natural language, we can just have a computer read it all and give us the cliff notes (and the scientific breakthroughs).

If you need economic census data on any industry, check out’s industry statistics portal. If finance is really evil, you ought to be able to find something damning in the data.

For those unfamiliar with Usenet, it’s sort of like a huge, text-only forum. It was much more popular before the rise of the world wide web. Anyways, you can download a huge data set of postings to Usenet here. It might be pretty good for some kind of textual analysis project or training a machine learning algorithm (maybe a spellchecker?) You could use the data to build out a Google Groups competitor, too.

Nick Bostrom has a very interesting paper called “Existential Risk Prevention as Global Priority.” The basic intuition is that preventing even small risks of human extinction is worthwhile if we consider all the human generations it would save. One way to start saving all those future lives might be by digging into this data set of every recorded meteor impact on Earth from 2500 BCE to 2012.

Speaking of mental health, if you’re interested in how it affects minorities specifically, try this.

There are a lot of lonely men and women out there, and some of those lonely men and women have excellent analytical skills. For those lonely people, I suggest using this data set, which “surveyed how Americans met their spouses and romantic partners, and compared traditional to non-traditional couples” to determine the best way to meet that special someone.

    Tons of data on what is called “adolescent health” available here, but is actually more, including a bunch of relationship data and biomarkers. (Not creatine levels, unfortunately.)

Here’s a question: Are modern jobs worse than those of the past? My grandparents built tires at Firestone. Today, people rarely have that level of control and visceral experience of the finished product of their work. This set of five surveys regarding how different groups experience employment could answer that question. I can see the article now — “Is everything getting slightly worse? We found out.”

Stanford has 35 million Amazon reviews available for download. Lot’s of stuff you could do with this: use it to improve recommendation algorithms, figure out whether or not there’s a follow-the-leader effect with reviews (i.e. Do early positive reviews beget more positive reviews?)

Based on some of my research prior to writing this, the google keyword “data sets on serial killers” is 1) really specific and 2) weirdly popular, but I guess there’s no accounting for taste. And, of course, we’ve got data for that, thanks to the Serial Killer Information Center. 4

In this gruesome vein, the University of Maryland has a “Global Terrorism Database,” which is a set of more than 113,000 terror incidents. You can download it after filling out a form. Ideas for use: visualization of terror incidents by location over time, predicting and preventing terror attacks, and creating early alert systems for vulnerable areas.

The MNIST Database is a classic in the field of machine learning. It’s a set of labeled hand-written characters, which are necessary for OCR algorithms. Today, some algorithms are actually more accurate than human judges! This would have been nice to have back when I was in grade school. I distinctly recall once arguing with a teacher over missing a question because she insisted that I had written the letter j when it was clearly a d . In the future, we’ll let the machines decide.

UCI has a poker hand data set available. My poker-fu is fairly weak, but I’m sure there’s some interesting analysis to be done there. I’ve heard second hand that humans still maintain some advantage over machines when it comes to poker, but I’m unable to verify that via Google. Machines have won in at least one tournament.

Another data set from UCI: images labeled as either advertisements or non-advertisements. This is good for building up classification algorithms that decide whether or not a new image is an ad or not, which might be good for, say, automatic ad blocking or spam detection. Or maybe a Google Glass application that filters out real life advertisements. That’d be cool. Look at a billboard and instead see a virtual extension of the natural landscape.

Remember the whole Star Wars Kid debacle? Wikipedia informs me that Attack of the Show rated it the number 1 viral video of all time. Andy Baio, one of the guys who was in on it before it was cool and coined the phrase “Star Wars Kid” has made his server logs from the time publicly available. Someone could take this data and produce a visualization of who saw it when via maps, along with annotations of where the traffic was coming from.

Who’s linking to who (and what) on WordPress? (Tidbit: most of the links to this site come from WordPress blogs.) With this WordPress crawl, you can find out. Visualizing the network might be sorta cool, but it’d be cooler still to uncover some information about “supernodes” that either are linked to often or put out a lot of links (or maybe both). Or maybe clustering people by interest.

Is Obama in bed with big oil? Or extremist environmentalists? Or the corn lobbies? And who was backing that Herman Cain dude, anyways? The 2012 Presidential Campaign Finance data is available for download. It would be neat to see an analysis of what industries prefer what candidates.

Cigarette data by state. Kentucky smokes the most, with West Virginia as a close second. Given the massive social harm of tobacco, a good analysis could very well save a lot of lives.

Want to build a Reddit recommendation engine? (Or, better yet, how about just a filter for the stupid-but-popular opinions?) Well, here’s the data a Redditor is using to do just that. The recommendation engine, I mean.

Global health data. This would be great for identifying high-impact ways to improve world health, like the Schistosomiasis Control Initiative, which is one of GiveWell’s top rated charities.

United States crime from 1960 to 2012. I’d like to see a graph of rape per capita over time (which, from a brief peek at the data, is dropping.) And then add the data for prison rape, which is morally repugnant but apparently a-okay to joke about on television.

Did you know that the best-selling item in Canadian grocery stores is Kraft Dinner (aka macaroni and cheese)? I wonder how it sells in Belgium or Taiwan. Here’s some supermarket data from there.

Data on usage of the Firefox web browser. Records things like number of tabs used, time active, number of private tabs opened. While that last point might allow for some titillating finds (private browsing is for porn!), it might be neat to see how accurate self-reports of time on internet compare to the actual data.

This one is super cool: Mozilla has put together a data set of the more than 200,000 bugs found in Mozilla and Eclipse. I would love to see a breakdown of what bugs are the most common and how they can be prevented. Software solutions would be worth a lot of money. Programming languages could be designed around them.

If you’re interested in the design of scheduling algorithms (I am!), Google has released a data set of the sort of jobs that they’re running on their clusters. Developing algorithms against this data set might help future proof your discoveries. After all, tomorrow’s desktop might look a lot like today’s data center.

Techcrunch released a data set with more than 400,000 company, investor, and entrepreneur profiles, along with an additional 45,000 investment rounds. This might be a good way to reverse engineer what the market’s looking for and what investors are funding.

Who receives H1-B visas? Might be interesting to know if some countries are more likely to get into the program or which companies “consume” the majority of the visas.

Here’s all the earthquakes between 1000 and 1903. Feeding them to a neural net and seeing what kind of predictions you get out might be neat. (And, hey, if you develop something better than the status quo, you can sell it and save lives!)

I’ve often wondered if the people who take personality tests online are more neurotic than the population at large. There’s a lot of data from a series of online personality tests available here, so you could compare their answers to those from the population at large, find out, and then send me an email.

And, finally, something I would have loved as a kid: the list to end all lists of naughty words.

Data Sets for Quantitative Research: Public Use Datasets

There are many research organizations making data available on the web, but still no perfect mechanism for searching the content of all these collections. The links below will take you to data search portals which seem to be among the best available. Note that these portals point to both free and pay sources for data, and to both raw data and processed statistics.

* Resources that are not entirely free are marked with an asterisk.

Machine Learning Datasets

These are the datasets that you will probably use while working on any data science or machine learning project:

Stay updated with latest technology trends
Join DataFlair on Telegram!!

Machine Learning Datasets for Data Science Beginners

1. Mall Customers Dataset

The Mall customers dataset contains information about people visiting the mall. The dataset has gender, customer id, age, annual income, and spending score. It collects insights from the data and group customers based on their behaviors.

1.2 Data Science Project Idea: Segment the customers based on the age, gender, interest. Customer segmentation is an important practise of dividing customers base into individual groups that are similar. It is useful in customised marketing.

2. Iris Dataset

The iris dataset is a simple and beginner-friendly dataset that contains information about the flower petal and sepal sizes. The dataset has 3 classes with 50 instances in each class, therefore, it contains 150 rows with only 4 columns.

2.1 Data Link: Iris dataset

2.2 Data Science Project Idea: Implement a machine learning classification or regression model on the dataset. Classification is the task of separating items into its corresponding class.

3. MNIST Dataset

This is a database of handwritten digits. It contains 60,000 training images and 10,000 testing images. This is a perfect dataset to start implementing image classification where you can classify a digit from 0 to 9.

3.1 Data Link: MNIST dataset

3.2 Data Science Project Idea: Implement a machine learning classification algorithm on image to recognize handwritten digits from a paper.

4. The Boston Housing Dataset

This is a popular dataset used in pattern recognition. It contains information about the different houses in Boston based on crime rate, tax, number of rooms, etc. It has 506 rows and 14 different variables in columns. You can use this dataset to predict house prices.

4.1 Data Link: Boston dataset

4.2 Data Science Project Idea: Predict the housing prices of a new house using linear regression. Linear regression is used to predict values of unknown input when the data has some linear relationship between input and output variables.

5. Fake News Detection Dataset

It is a CSV file that has 7796 rows with 4 columns. The first column identifies news, second for the title, third for news text and fourth is the label TRUE or FAKE.

5.2 Data Science Project Idea: Build a fake news detection model with Passive Aggressive Classifier algorithm. The Passive Aggressive algorithm can classify massive streams of data, it can be implemented quickly.

6. Wine quality dataset

The dataset contains different chemical information about wine. It has 4898 instances with 14 variables each. The dataset is good for classification and regression tasks. The model can be used to predict wine quality.

6.2 Data Science Project Idea: Perform various different machine learning algorithms like regression, decision tree, random forests, etc and differentiate between the models and analyse their performances.

7. SOCR data – Heights and Weights Dataset

This is a simple dataset to start with. It contains only the height (inches) and weights (pounds) of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the heights or weights of a human.

7.2 Data Science Project Idea: Build a predictive model for determining height or weight of a person. Implement a linear regression model that will be used for predicting height or weight.

8. Parkinson Dataset

Parkinson is a nervous system disorder that affects movement. The dataset contains 195 records of people with 23 different attributes which contain biomedical measurements. The data is used to separate healthy people from people with Parkinson’s disease.

8.1 Data Link: Parkinson dataset

8.2 Data Science Project Idea: The model can be used to differentiate healthy people from people having Parkinson’s disease. The algorithm that is useful for this purpose is XGboost which stands for extreme gradient boosting, it is based on decision trees.

9. Titanic Dataset

On 15 April 1912, the unsinkable Titanic ship sank and killed 1502 passengers out of 2224. The dataset contains information like name, age, sex, number of siblings aboard, etc of about 891 passengers in the training set and 418 passengers in the testing set.

9.1 Data Link: Titanic dataset

9.2 Data Science Project Idea: Build a fun model to predict whether a person would have survived on the Titanic or not. You can use linear regression for this purpose.

10. Uber Pickups Dataset

The dataset has information of about 4.5 million uber pickups in New York City from April 2014 to September 2014 and 14million more from January 2015 to June 2015. Users can perform data analysis and gather insights from the data.

10.2 Data Science Project Idea: To analyze the data of the customer rides and visualize the data to find insights that can help improve business. Data analysis and visualization is an important part of data science. They are used to gather insights from the data and with visualization you can get quick information from the data.

11. Chars74k Dataset

The dataset contains images of character symbols used in the English and Kannada languages. It has 64 classes (0-9, A-Z, a-z), 7.7k characters from natural images, 3.4k hand-drawn characters, and 62k computer-synthesized fonts.

11.1 Data Link: Chars 74k dataset

11.2 Data Science Project Idea: Implement character recognition in natural languages. Character recognition is the process of automatically identifying characters from written papers or printed texts.

12. Credit Card Fraud Detection Dataset

The dataset contains transactions made by credit cards, they are labeled as fraudulent or genuine. This is important for companies that have transaction systems to build a model for detecting fraudulent activities.

12.2 Data Science Project Idea: Implement different algorithms like decision trees, logistic regression, and artificial neural networks to see which gives better accuracy. Compare the results of each algorithm and understand the behavior of models.

13 Chatbot Intents Dataset

The dataset is a JSON file that contains different tags like greetings, goodbye, hospital_search, pharmacy_search, etc. Each tag contains a list of patterns a user can ask and the responses a chatbot can respond according to that pattern. The dataset is good for understanding how chatbot data works.

13.2 Data Science Project Idea: Tweak and expand the data with your observations to build and understand the working of a chatbot in organizations. A chatbot requires you to understand Natural language processing concepts.

Machine Learning Datasets for Natural Language Processing

1. Enron Email Dataset

This Enron dataset is popular in natural language processing. It contains around 0.5 million emails of over 150 users out of which most of the users are the senior management of Enron. The size of the data is around 432Mb.

1.2 Machine Learning Project Idea: Use k-means clustering to build a model to detect fraudulent activities. K-means clustering is a popular unsupervised learning algorithm. It partitions the observations into k number of clusters by observing similar patterns in the data.

2. The Yelp Dataset

The yelp made their dataset publicly available but you have to fill a form first to access the data. It contains 1.2 million tips by 1.6 million users, over 1.2 million business attributes and photos for natural language processing tasks.

2.1 Data Link: Yelp dataset

2.2 Machine Learning Project Idea: You can build a model which can detect whether a restaurant’s review is fake or real. With text processing and additional features in dataset you can build a SVM model that can classify reviews as fake or real.

3. Jeopardy Dataset

Jeopardy! is an American television game show in which general knowledge questions are asked with a twist. The dataset contains 200k+ questions and answers in a CSV or JSON file.

3.1 Data Link: Jeopardy dataset

3.2 Machine Learning Project Idea: We Build a question answering system and implement in a bot that can play the game of jeopardy with users. The bot can be used on any platform like Telegram, discord, reddit, etc.

4. Recommender Systems Dataset

This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.

4.2 Machine Learning Project Idea: Build a product recommendation system like Amazon. A recommendation system can suggest you products, movies, etc based on your interests and the things you like and have used earlier.

5. UCI Spambase Dataset

Classifying emails as spam or non-spam is a very common and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.

5.2 Machine Learning Project Idea: You can build a model that can identify your emails as spam or non-spam.

6. Flickr 30k Dataset

The Flickr 30k dataset is similar to the Flickr 8k dataset and it contains more labeled images. This has over 30,000 images and their captions. This dataset is used to build more accurate models than the Flickr 8k dataset.

6.2 Machine Learning Project Idea: Use the same model from Flickr 8k and make it more accurate with more training data. The CNN model is great for extracting features from the image and then we feed the features to a recurrent neural network that will generate caption.

7. IMDB reviews

The large movie review dataset consists of movie reviews from IMDB website with over 25,000 reviews for training and 25,000 for the testing set.

7.2 Machine Learning Project Idea: Perform Sentiment analysis on the data to see the statistics of what type of movie do users like. Sentiment analysis is the process of analysing the textual data and identifying the emotion of the user, Positive or Negative.

8. MS COCO dataset

Microsoft’s COCO is a huge database for object detection, segmentation and image captioning tasks. It has around 1.5 million labeled images. The dataset is great for building production-ready models.

8.1 Data Link: MS COCO dataset

8.2 Machine Learning Project Idea: Detect objects from the image and then generate captions for them. LSTM (Long short term memory) network is responsible for generating sentences in English and CNN is used to extract features from image. To build a caption generator we have to combine these two models.

9. Flickr 8k Dataset

The Flickr 8k dataset contains 8000 images and each image is labeled with 5 different captions. The dataset is used to build an image caption generator.

9.1 Data Link: Flickr 8k dataset

9.2 Machine Learning Project Idea: Build an image caption generator using CNN-RNN model. An image caption generator model is able to analyse features of the image and generate english like sentence that describes the image.

Machine Learning Datasets for Computer Vision and Image Processing

1. CIFAR-10 and CIFAR-100 dataset

These are two datasets, the CIFAR-10 dataset contains 60,000 tiny images of 32*32 pixels. They are labeled from 0-9 and each digit is representing a class. The CIFAR-100 is similar to the CIFAR-10 dataset but the difference is that it has 100 classes instead of 10. This dataset is good for implementing image classification.

1.1 Data Link: CIFAR dataset

1.2 Artificial Intelligence Project Idea: Perform image classification on different objects and build a model. In image classification, we take image as an input and the goal is to classify in which category the image belongs to.

2. GTSRB (German traffic sign recognition benchmark) Dataset

The GTSRB dataset contains around 50,000 images of traffic signs belonging to 43 different classes and contains information on the bounding box of each sign. The dataset is used for multiclass classification.

2.1 Data Link: GTSRB dataset

2.2 Artificial Intelligence Project Idea: Build a model using a deep learning framework that classifies traffic signs and also recognises the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then take appropriate actions.

3. ImageNet dataset

ImageNet is a large image database that is organized according to the wordnet hierarchy. It has over 100,000 phrases and an average of 1000 images per phrase. The size exceeds 150 GB. It is suitable for image recognition, face recognition, object detection, etc. It also hosts a challenging competition named ILSVRC for people to build more and more accurate models.

3.1 Data Link: Imagenet Dataset

3.2 Artificial Intelligence Project Idea: To implement image classification on this huge database and recognise objects. CNN model (Convolutional neural networks) are necessary for this project to get accurate results.

4. Breast Histopathology Images Dataset

This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.

4.2 Artificial Intelligence Project Idea: To build a model that can classify breast cancer. You build an image classification model with Convolutional neural networks.

5. Cityscapes Dataset

This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.

5.2 Artificial Intelligence Project Idea: To perform image segmentation and detect different objects from a video on the road. Image segmentation is the process of digitally partitioning an image into various different categories like cars, buses, people, trees, roads, etc.

6. Kinetics Dataset

There are three different datasets for Kinetics: Kinetics 400, Kinetics 600 and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5Million high-quality videos.

6.1 Data Link: Kinetics dataset

6.2 Artificial Intelligence Project Idea: Build a human action recognition model and detect the action of a human. Human action recognition is recognized by a series of observations.

7. MPII human pose dataset

The MPII human pose dataset contains 25,000 images with over 40,000 people with annotated body joints. The overall dataset covers over 410 human activities. The dataset is 12.9 GB in size.

7.2 Artificial Intelligence Project Idea: To detect different human poses based on the alignment of a person’s body joints. Human pose detection tracks every movement of the body. It is also known as the localization of human joints.

8. 20BN-something-something dataset v2

This is a huge high-quality video clips dataset that shows human performing actions like picking something, putting something down, opening something, closing something, etc.

It has 220,847 total number of videos.

8.2 Artificial Intelligence Project Idea: To implement a human action recognition model and detect different activities performed by a human. The activities can be used in detecting activities while driving, surveillance activities, etc.

9. Object 365 Dataset

The object 365 dataset is a large collection of high-quality images with bounding boxes of objects. It has 365 objects, 600k images, and 10 million bounding boxes. This is good for making object detection models.

9.1 Data Link: Object 365 dataset

9.2 Artificial Intelligence Project Idea: Classify images captured from the camera and detect objects present in the image. Object detection deals with recognizing which object is present in the image along with the coordinates of the object.

10. Photo sketching dataset

The dataset contains images paired with their contour drawings. It has 1000 outdoor drawings, each image has 5 rough contour drawings that represent the outline of the image.

10.2 Artificial Intelligence Project Idea: Build a model that can develop sketches automatically from the images. This will take an image as an input and generate a sketch image using computer vision techniques.

11. CQ500 Dataset

This dataset is publicly available that has 491 head CT scans with 193,317 slices. It contains opinions of three different radiologists on each image. The dataset can be used to build models that can detect bleeding, fractures and mass effect on the head.

11.1 Data Link: CQ 500 dataset

11.2 Artificial Intelligence Project Idea: Make a model for hospitals that can automatically generate a report of a fracture, bleeding or other things by analyzing the CT scan dataset.

12. IMDB-Wiki dataset

The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.

12.1 Data Link: IMDB wiki dataset

12.2 Artificial Intelligence Project Idea: Make a model that will detect faces and predict their gender and age. You can have categories in different ranges like 0-10, 10-20, 30-40, 50-60, etc.

13. Color Detection Dataset

The dataset contains a CSV file that has 865 color names with their corresponding RGB(red, green and blue) values of the color. It also has the hexadecimal value of the color.

13.2 Artificial Intelligence Project Idea: The color dataset can use used to make a color detection app in which we can have an interface to pick a color from the image and the app will display the name of the color.

Machine Learning Datasets for Deep Learning

1. Youtube 8M Dataset

The youtube 8M dataset is a large scale labeled video dataset that has 6.1millions of Youtube video ids, 350,000 hours of video, 2.6 billion audio/visual features, 3862 classes and 3avg labels per video. It is used for video classification purposes.

1.1 Data Link: Youtube 8M

1.2 Machine Learning Project Idea: Video classification can be done by using the dataset and the model can describe what video is about. A video takes a series of inputs to classify in which category the video belongs.

2. Urban Sound 8K dataset

The urban sound dataset contains 8732 urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc. The dataset is popular for urban sound classification problems.

2.2 Machine Learning Project Idea: We can build a sound classification system to detect the type of urban sound playing in the background. This will help you get started with audio data and understand how to work with unstructured data.

3. LSUN Dataset

Large scale scene understanding (LSUN) is a dataset of millions of colored images of scenes and objects. It is much bigger than imagenet dataset. There are around 59 million images, 10 different scenes categories, and 20 different object categories.

3.1 Data Link: LSUN dataset

3.2 Machine Learning Project Idea: Build a model to detect what scene is in the image. For example – a classroom, bridge, bedroom, curch_outdoor, etc. The goal of scene understanding is to gather as much knowledge of a given scene image as possible. It includes categorization, object detection, object segmentation.

4. RAVDESS Dataset

RAVDESS is the acronym of The Ryerson Audio-Visual Database of Emotional Speech and Song. It contains audio files of 24 actors (12 male, 12 female ) with different emotions like calm, angry, sad, happy, fearful, etc. The expressions have two intensity normal and strong. The dataset is useful for speech emotion recognition.

4.1 Data Link: RAVDESS dataset

4.2 Machine Learning Project Idea: Build a speech emotion recognition classifier to detect the emotion of the speaker. The audio clips of people are classified into emotions like anger, happy, sad, etc.

5. Librispeech Dataset

This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English read speech in various accents. It is used for speech recognition projects.

5.2 Machine Learning Project Idea: Build a speech recognition model to detect what is been said and convert it into text. The objective of speech recognition is to automatically identify what is being said in audio.

6. Baidu Apolloscape Dataset

The dataset is designed to promote the development of self-driving technologies. It contains high-resolution color videos with hundreds of thousands of frames and their pixel annotations, stereo image, dense point cloud, etc. The dataset has 25 different semantic items like cars, pedestrians, cycles, street lights, etc.

6.2 Machine Learning Project Idea: Build a self-driving robot that can identify different objects on the road and take action accordingly. The model can segment the objects in the image that will help in preventing collisions and make their own path.

Machine Learning Datasets for Finance and Economics

1. quandl Data Portal

The quandl is a vast repository for economic and financial data. Some of the datasets are free while there are also some datasets that need to be purchased. The large quantity and good data make this platform best for finding datasets for production-ready models.

1.1 Data Link: quandl datasets

2. The World Bank Open Data Portal

The World Bank is a global development organization that offers loans to developing countries. It contains huge data for all its program and it is publicly available to us. It has many missing values and you can get knowledge of real-world data.

3. IMF Data Portal

IMF is the international monetary fund that publishes data on international finances, debt rates, investments, and foreign exchange reserves and commodities.

3.1 Data Link: IMF datasets

4. American Economic Association (AEA) Data Portal

The American economic association has wealthy data that is available online and is a great resource to find US macroeconomic data.

4.1 Data Link: AEA datasets

5. Google Trends Data Portal

Google trends data can be used to examine and analyze the data visually. You can also download the dataset into CSV files with a simple click. We can find out what’s trending and what people are searching for.

6. Financial Times Market Data Portal

The financial times market data is a good resource to find up to date information on financial markets from all over the world. You can find the stock prices indexes, commodities, and foreign exchange

Machine Learning Datasets for Public Government

1. Portal

This site is the home of the US government’s open data. You can find data on various domains like agriculture, health, climate, education, energy, finance, science, and research, etc. Many software applications are using the website to collect data and building consumer products.

1.1 Data Link: datasets

2. Data Portal: Open government data (India)

The open government data platform gives us access to government-owned shareable data. It’s part of the digital India initiative and developed by open source stack. It publishes many datasets, tools, APIs, etc.

3. Food environment Atlas Data Portal

The platform contains data on US food and how local US food affects the diet of the people. It contains information about the research on food choices and diet quality which will help in determining the accessibility to healthy food choices.

4. Health Data Portal

This is a portal of the US Department of Health and Human Services. It has over 3000 plus valuable datasets available. They also have an API for us.

4.1 Data Link: Health datasets

5. Centers for Disease Control and Prevention Data Portal

The CDC has a wide variety of datasets related to health like diabetes, cancer, obesity, etc. There are more resources where you can find data on health diseases.

6. London Datastore Portal

This contains data about the life of people in London. For example – how much the population has increased in 5 years or the number of tourists visiting London. They have over 700 datasets to get insights into the London city.

7. Canada Government Open Data Portal

This is a portal to the data related to Canadians. You can find datasets related to subjects like agriculture, art, music, education, government, health, etc.

How to process data annotation?

Step 1: Data Collection

Data collection is the process of gathering and measuring information from countless different sources. To use the data we collect to develop practical artificial intelligence (AI) and machine learning solutions, it must be collected and stored in a way that makes sense for the business problem at hand.

There are several ways to find data. In classification algorithm cases, it is possible to rely on class names to form keywords and to use crawling data from the Internet to find images. Or you can find photos, videos from social networking sites, satellite images on Google, free collected data from public cameras or cars (Waymo, Tesla), even you can buy data from third parties (notice the accuracy of data). Some of the common datasets can be found on free websites like Common Objects in Context (COCO), ImageNet, and Google’s Open Images.

Some common data types are Image, Video, Text, Audio, and 3D sensor data.

  • Image: Often are photographs of people, objects, or animals.
  • Video: Recorded tape from CCTV or camera, usually divided into scenes.
  • Text: Different types of documents include numbers and words and they can be in multiple languages.
  • Audio: They are sound records from people having dissimilar demographics.
  • 3D Sensor data: 3D models generated by sensor devices.

Step 2: Identify the problem

Knowing what problem you are dealing with will help you to decide the techniques you should use with the input data. In computer vision, there are some tasks such as:

  • Image classification: Collect and classify the input data by assigning a class label to an image.
  • Object detection & localization: Detect and locate the presence of objects in an image and indicate their location with a bounding box, point, line, or polyline.
  • – Object instance / semantic segmentation: In semantic segmentation, you have to label each pixel with a class of objects (Car, Person, Dog, etc.) and non-objects (Water, Sky, Road, etc.). Polygon and masking tools can be used for object semantic segmentation.

Step 3: Data Annotation

After identifying the problems, now you can process the data labeling accordingly. With the classification task, the labels are the keywords used during finding and crawling data. For instance segmentation task, there should be a label for each pixel of the image. After getting the label, you need to use tools to perform image annotation (i.e. to set labels and metadata for images). The popular tools can be named Comma Coloring, Annotorious, LabelMe. You can refer to some of the common data annotation tools and their features in our infographic here.

However, this way is manual and time-consuming. A faster alternative is to use algorithms like Polygon-RNN ++ or Deep Extreme Cut. Polygon-RNN ++ takes the object in the image as the input and gives the output as polygon points surrounding the object to create segments, thus making it more convenient to labeling. The working principle of Deep Extreme Cut is similar to Polygon-RNN ++ but it allows up to 4 polygons.

It is also possible to use the “Transfer Learning” method to label data, by using pre-trained models on large-scale datasets such as ImageNet, Open Images. Since the pre-trained models have learned many features from millions of different images, their accuracy is fairly high. Based on these models, you can find and label each object in the image. It should be noted that these pre-trained models must be similar to the collected dataset to perform feature-extraction or fine-turning.

The benefits of utilizing metadata graphs and machine learning

According to the Forrester Insights-Driven Business report , businesses are drowning in data but starving for insights. For trustworthy insights, the data driving the analysis must be trustworthy and analysts must be able to find the best data for their purpose. You can ensure better understanding of and access to trusted data through extensive metadata associated with each data asset, providing important and accurate business context.

Machine learning powers the building of active metadata graphs by intelligently automating data classification, cataloging, lineage and policy management to add rich context to data assets at scale.

Active metadata graphs help you:

  • Generate greater visibility into the data landscape: With ML-powered automatic data classification and auto-linking of data sets, business terms, policies, processes and more, data curators and data consumers can collaborate on business semantics for trusted data.
  • Evaluate the right data for your needs: Data profiling, data scoring, and crowdsourced ratings and reviews strengthen data context and allow business analysts and data scientists to evaluate and choose the best data for their purposes.
  • Enhance data shopping experience: With highly relevant and rich business context around data, users have a more intuitive and simplified data shopping experience, which provides the right data with the right context to the right users.
  • Deliver faster insights: Automated discovery, understanding, and collaborative data access for business analysts and data scientists reduce time to insights .

Step 3 : Data Cleaning

This is the Most important step of all Machine Learning and Data Science projects. It is about 80% of the overall work. For this project I have done data cleaning manually by identifying the relation between multiple columns, although there are some tools and standard procedures available but I found it more suitable as per the accuracy.

Modules ( Functions ) which are used to clean features of data set are as follows :

  1. Product category feature : ProductCatagory2() module mainly deals with the product category feature and removes all the NaN values based on above specified conditions.

2. Gender feature: Similarly Gender() module deals with gender feature and userGroupId() deals with group id data cleaning.

3. Date Time feature : Since the Date Time column contains all the time parameters like year, month, date, hour and minute they must be split in order to make data set more relevant.

Now data cleaning is done all the NaN values have been replaced with the effective values, now its time to split the data set into two parts i.e. training part and testing part.

Don’t be confused with the training and testing files, at this point we will use only training file for train test split method.


Next, let’s go ahead and visualize our findings, we want data that any non technical stakeholders can easily consume. We can do this using Panda’s DataFrames and PyPlot.

For both we will need the following libraries:

As you can see I have chosen to work with 2 separate DataFrames from two lists, Polarity and Subjectivity.

You can, alternatively generate a single histogram with both sets of data - if you do I suggest you adjust transparency as shown below.