Other Data Sources
The web is a vast source of datasets on almost any subject, such as demographics, disease, economics, finance, geography, entertainment, science, etc. You can always start with Google’s Dataset Search that indexes thousands of public datasets.
Here are some more suggestions:
- Kaggle: Kaggle hosts machine learning competitions and contains a large number of datasets that are generally free and open to the public.
- Awesome Public Datasets: Collection of public datasets, arranged by area
- UK data and UK Office for National Statistics
- U.S. Government’s open data with many datasets on a range of issues
- Data is plural: a weekly newsletter that has collected over a thousand useful/curious datasets. This may well be one of my favourite dataset collections!
- Our World in Data contains time series of demographic and global development data. Their collection of Covid-19 data is among the best.
- TidyTuesday: A weekly data project in R where they release a new dataset every week and emphasis is placed on understanding how to summarise and arrange data to make meaningful charts with ggplot2, tidyr, dplyr, and other tools in the tidyverse ecosystem.
- fivethirtyeight.com is a data-driven journalism site that share the data on most of their stories
- In terms of investigative journalism, The Markup and ProPublica are both data-driven and share their data; All Markup data is freely available and ProPublica provides many of their datasets for free
- Erik Gahner’s list of political science datasets: Datasets divided by topic (governance, elections, policy, political elites, etc.), geography (country, region), etc.
- BigQuery public datasets Google has set up BigQuery which is a data warehouse for some large datasets that you really need to access with SQL. There is even an R package
bigrquerythat allows you to easily talk with BigQuery’s database.