Data Sources

Finding Data on the Internet http://www.inside-r.org/howto/finding-data-internet

 

Dynamic Data Sets / Databases:

  • College Scorecard (https://collegescorecard.ed.gov/).  A tremendous amount of information about all universities (though some of it collected only from students on financial aid).
  • Financial and Economic data (https://www.quandl.com/)
  • Behavioral Risk Factor Surveillance System: http://www.cdc.gov/brfss/
  • General Social Survey (http://www3.norc.org/GSS+Website/)
  • National Health and Nutrition Examination Survey from the CDC: http://www.cdc.gov/nchs/nhanes/nhanes_questionnaires.htm
  • Medicare dataset (discussed on whitehouse.gov)
  • State Health Facts: http://www.statehealthfacts.org/
  • gapminder.org – a fascinating website with amazing graphics (social and economic data broken down by country). Click on the spreadsheet links to download the data.
  • Wolfram/Alpha (http://www.wolframalpha.com/ )– This is billed as a computational search engine. Put in “nachos” you get a detailed nutritional analysis, put in “GDP of Albania”and you get several forms of GDP, a historical graph and other economic variables, put in your favorite college and get lots of info (including number of degrees in mathematics in 2009, location on a map and link to a satellite view of campus). While the case by case data display is not so convenient for building datasets there are pretty good links to the sources that Wolfram is pulling data from. For example, the Wolfram/Alpha page of info on a college or university has a data source link at the bottom to the National Center For Educational Statistics website where you can download your own custom data files from the IPSEDS (Integrated Post Secondary Education Data System) – want to know the average faculty salary by rank for all the schools in your comparison group? or the nacho search gives a link to the USDA’s National Nutrient database and a few clicks later I’ve got a spreadsheet with data on 50+ nutrients in 7400+ foods (and that’s the abbreviated data!)
  • The Census Bureau
  • Baby names (popularity by year and state), compiled by the Social Security Administration
  • epidata: R package to link to the API at http://www.epi.org/.  The Economic Policy Institute provides researchers, media, and the public with easily accessible, up-to-date, and comprehensive historical data on the American labor force. It is compiled from Economic Policy Institute analysis of government data sources. Use it to research wages, inequality, and other economic indicators over time and among demographic groups. Data is usually updated monthly.
  • acs: R package to link to the API at https://www.census.gov/data/developers/data-sets.html.  Provides a general toolkit for downloading, managing, analyzing, and presenting data from the U.S. Census, including SF1 (Decennial short-form), SF3 (Decennial long-form), and the American Community Survey (ACS). 

 

New & Continuously Revised Static Data Sets / Databases:

 

Static Data Sets / Databases:

 

Websites for Visualizing Data:

  • Information is Beautiful (http://www.informationisbeautiful.net/)
  • Many Eyes (http://manyeyes.alphaworks.ibm.com/manyeyes/) This is billed as a wiki for data and visualizations of data. Users can contribute datasets and graphics as well as comment on what others have contributed. Some of the “visualizations” are pretty bizarre – others are interesting, e.g. I’m not sure where else I could find different datasets (e.g. current average home rental prices) from counties in Ireland and display the data by shading a map of Ireland with the variable I choose and have a link to the report where the data appear. A search with keyword “golf” produced 14 hits – including several of which referred to the Volkswagen Golf, a couple where individual golfers posted datasets with their own scores (and quite detailed info for each round), listings of the length and price to play golf courses in the Toronto area, the World Gold Rankings Top 250 golfers (from 2007) and data on PGA Tour golfers (from ESPN) for the 2007 season.
  • From Mark Ward at Purdue http://llc.stat.purdue.edu/2014/29000/visualsites.html
  • Nathan Yau’s amazing visualizations: http://flowingdata.com/ (mostly including corresponding datasets)