Data Sources

Websites for Visualizing Data:

Collections of Data Sets:

  • realclimate.org keeps an up to date catalogue of many different types of climate data (http://www.realclimate.org/index.php/data-sources/)
  • College Scorecard (https://collegescorecard.ed.gov/).  A tremendous amount of information about all universities (though some of it collected only from students on financial aid).
  • Financial and Economic data (https://www.quandl.com/)
  • Behavioral Risk Factor Surveillance System: http://www.cdc.gov/brfss/
  • General Social Survey (http://www3.norc.org/GSS+Website/)
  • National Health and Nutrition Examination Survey from the CDC: http://www.cdc.gov/nchs/nhanes/nhanes_questionnaires.htm
  • FEC contributions data (as part of Hadley Wickham’s dplyr package)
  • Medicare dataset (discussed on whitehouse.gov)
  • Yahoo big data datasets
  • SF OKCupid Users Everett Wetchler wrote a python script back in the day to rip the public profiles of San Francisco OkCupid users.  He pulled one snapshot (June 26, 2012) of all OkCupid users who lived within 25 miles of San Francisco along with other caveats. It might be of interest to students given the recent press that data-driven approaches to online dating have been getting, specifically the Wired article “How a Math Genius Hacked OkCupid to Find True Love” and Amy Webb’s Ted Talk “How I hacked online dating”.
  • This growing dataset repository presents raw data from real medical studies and offers (a) a vignette summarizing the study, research question and study design; (b) a data dictionary with clear documentation of variables and codes; (c) a complete citation for the associated study publication; and (d) a variety of data formats compatible with the majority of statistical packages. http://www.lerner.ccf.org/qhs/datasets/
  • CAUSEweb data http://www.causeweb.org/cwis/SPT–BrowseResources.php?ParentId=5
  • Data formatted to use in R http://www.icpsr.umich.edu/icpsrweb/ICPSR/support/announcements/2013/03/icpsr-releases-new-datasets-in-r-format
  • A data repository from statsci.org – a statistics and bioinformatics group in Australia http://www.statsci.org/datasets.html
  • Finding Data on the Internet http://www.inside-r.org/howto/finding-data-internet
  • State Health Facts: http://www.statehealthfacts.org/
  • gapminder.org – a fascinating website with amazing graphics (social and economic data broken down by country). Click on the spreadsheet links to download the data.
  • Wolfram/Alpha (http://www.wolframalpha.com/ )– This is billed as a computational search engine. Put in “nachos” you get a detailed nutritional analysis, put in “GDP of Albania”and you get several forms of GDP, a historical graph and other economic variables, put in your favorite college and get lots of info (including number of degrees in mathematics in 2009, location on a map and link to a satellite view of campus). While the case by case data display is not so convenient for building datasets there are pretty good links to the sources that Wolfram is pulling data from. For example, the Wolfram/Alpha page of info on a college or university has a data source link at the bottom to the National Center For Educational Statistics website where you can download your own custom data files from the IPSEDS (Integrated Post Secondary Education Data System) – want to know the average faculty salary by rank for all the schools in your comparison group? or the nacho search gives a link to the USDA’s National Nutrient database and a few clicks later I’ve got a spreadsheet with data on 50+ nutrients in 7400+ foods (and that’s the abbreviated data!)
  • Many Eyes (http://manyeyes.alphaworks.ibm.com/manyeyes/) This is billed as a wiki for data and visualizations of data. Users can contribute datasets and graphics as well as comment on what others have contributed. Some of the “visualizations” are pretty bizarre – others are interesting, e.g. I’m not sure where else I could find different datasets (e.g. current average home rental prices) from counties in Ireland and display the data by shading a map of Ireland with the variable I choose and have a link to the report where the data appear. A search with keyword “golf” produced 14 hits – including several of which referred to the Volkswagen Golf, a couple where individual golfers posted datasets with their own scores (and quite detailed info for each round), listings of the length and price to play golf courses in the Toronto area, the World Gold Rankings Top 250 golfers (from 2007) and data on PGA Tour golfers (from ESPN) for the 2007 season.
  • http://timetric.com/ — time-series data sets, uploaded by users.
  • http://archive.ics.uci.edu/ml/datasets.html — UC Irvine’s Machine Learning Repository
  • Journal of Statistics Education Data Archive – datasets contributed by statistics teachers. Raw data are given in a .dat file with explanations of the variables in an accompanying .doc file. Several of these datasets are tied to longer JSE articles discussing their use in statistics classes. For example, try televisions.dat, televisions.txt, and Rossman article for some data on life expectancy and numbers of televisions in various countries.
  • Baby names (popularity by year and state), compiled by the Social Security Administration
  • DASL is the Data and Story Library – a collection of datasets and related documentation which may be searched by data subjects or by statistical techniques
  • DASL in Australia
  • Statlib Dataset Archive – one of the original sources for archived data
  • National Institute of Standards and Technology (NIST) education data sets
  • CHANCE Project Datasets – data from recent media coverage of current events. Only a few datasets here, but many excellent references to teaching applications of statistics in the news can be found at the main CHANCE page
  • Electronic Dataset Service – a collection of links to datasets organized by statistical methods
  • Data – a collection of datasets from the book DATA by Andrews and Herzberg, stored at Statlib
  • FEDSTATS links to Web access to data produced by the US Government agencies like:
  • Sports Data Page

 

 

A Few Fun Datasets: