Data Sources

This page is no longer being supported, please see the new website.

Finding Data on the Internet http://www.inside-r.org/howto/finding-data-internet

R packages for connecting to APIs:

  • coinmarketcapr: Connecting to Coin Market Cap to get Cryptocurrencies Market Cap Prices (https://github.com/amrrs/coinmarketcapr).
  • rtweet: R client for accessing Twitter [stream and REST] API http://rtweet.info and https://github.com/mkearney/rtweet
  • epidata: R package to link to the API at http://www.epi.org/.  The Economic Policy Institute provides researchers, media, and the public with easily accessible, up-to-date, and comprehensive historical data on the American labor force. It is compiled from Economic Policy Institute analysis of government data sources. Use it to research wages, inequality, and other economic indicators over time and among demographic groups. Data is usually updated monthly.
  • acs: R package to link to the API at https://www.census.gov/data/developers/data-sets.html.  Provides a general toolkit for downloading, managing, analyzing, and presenting data from the U.S. Census, including SF1 (Decennial short-form), SF3 (Decennial long-form), and the American Community Survey (ACS). 
  • Rfacebook package: Access to Facebook API via R (https://github.com/pablobarbera/Rfacebook)
  • tidyhydat:  Canadian hydrometric data — Historical data is contained within HYDAT, the Canadian national Water Data Archive, which is published quarterly by the Government of Canada’s Department of Environment and Climate Change. Data in this archive range from 1850 to 2017. tidyhydat also provides functions to access real-time data over the web. This package would be of interest to anyone who has need for Canadian hydrometric data in R.
  • ipumsr: An easy way to import census, survey and geographic data provided by ‘IPUMS’ into R plus tools to help use the associated metadata to make analysis easier. ‘IPUMS’ data describing 1.4 billion individuals drawn from over 750 censuses and surveys is available free of charge from our website <https://ipums.org>.
  • ess: Download data from the European Social Survey directly from their website <http://www.europeansocialsurvey.org/>. There are two families of functions that allow you to download and interactively check all countries and rounds available.
  • data360r:  Makes it easy to engage with the Application Program Interface (API) of the TCdata360 and Govdata360 platforms at <https://tcdata360.worldbank.org/> and <https://govdata360.worldbank.org/>, respectively. These APIs provide access to over 5000 trade, competitiveness, and governance indicator data, metadata, and related information from sources both inside and outside the World Bank Group. Package functions include easier download of data sets, metadata, and related information, as well as searching based on user-inputted query.
  • lahman: Provides the tables from the ‘Sean Lahman Baseball Database’ as a set of R data.frames. It uses the data on pitching, hitting and fielding performance and other tables from 1871 through 2015, as recorded in the 2016 version of the database.
  • wbstats: Women in Parliament dataset and link to worldbank: https://github.com/saghirb/Women-in-Parliament-Hex-Sticker

R packages containing multiple datasets:

Dynamic Data Sets / Databases:

  • College Scorecard (https://collegescorecard.ed.gov/).  A tremendous amount of information about all universities (though some of it collected only from students on financial aid).
  • National Park Service Visitor Use Statistics (https://irma.nps.gov/Stats/)
  • Financial and Economic data (https://www.quandl.com/)
  • Behavioral Risk Factor Surveillance System: http://www.cdc.gov/brfss/
  • General Social Survey (http://www3.norc.org/GSS+Website/)
  • National Health and Nutrition Examination Survey from the CDC: http://www.cdc.gov/nchs/nhanes/nhanes_questionnaires.htm
  • Medicare dataset (discussed on whitehouse.gov)
  • State Health Facts: http://www.statehealthfacts.org/
  • gapminder.org – a fascinating website with amazing graphics (social and economic data broken down by country). Click on the spreadsheet links to download the data.
  • Wolfram/Alpha (http://www.wolframalpha.com/ )– This is billed as a computational search engine. Put in “nachos” you get a detailed nutritional analysis, put in “GDP of Albania”and you get several forms of GDP, a historical graph and other economic variables, put in your favorite college and get lots of info (including number of degrees in mathematics in 2009, location on a map and link to a satellite view of campus). While the case by case data display is not so convenient for building datasets there are pretty good links to the sources that Wolfram is pulling data from. For example, the Wolfram/Alpha page of info on a college or university has a data source link at the bottom to the National Center For Educational Statistics website where you can download your own custom data files from the IPSEDS (Integrated Post Secondary Education Data System) – want to know the average faculty salary by rank for all the schools in your comparison group? or the nacho search gives a link to the USDA’s National Nutrient database and a few clicks later I’ve got a spreadsheet with data on 50+ nutrients in 7400+ foods (and that’s the abbreviated data!)
  • The Census Bureau
  • Baby names (popularity by year and state), compiled by the Social Security Administration

 

New & Continuously Revised Static Data Sets / Databases:

Journals / Journal articles that provide corresponding data:

Static Data Sets / Databases:

Websites for Visualizing Data:

  • Information is Beautiful (http://www.informationisbeautiful.net/)
  • Many Eyes (http://manyeyes.alphaworks.ibm.com/manyeyes/) This is billed as a wiki for data and visualizations of data. Users can contribute datasets and graphics as well as comment on what others have contributed. Some of the “visualizations” are pretty bizarre – others are interesting, e.g. I’m not sure where else I could find different datasets (e.g. current average home rental prices) from counties in Ireland and display the data by shading a map of Ireland with the variable I choose and have a link to the report where the data appear. A search with keyword “golf” produced 14 hits – including several of which referred to the Volkswagen Golf, a couple where individual golfers posted datasets with their own scores (and quite detailed info for each round), listings of the length and price to play golf courses in the Toronto area, the World Gold Rankings Top 250 golfers (from 2007) and data on PGA Tour golfers (from ESPN) for the 2007 season.
  • From Mark Ward at Purdue http://llc.stat.purdue.edu/2014/29000/visualsites.html
  • Nathan Yau’s amazing visualizations: http://flowingdata.com/ (mostly including corresponding datasets)