Fall 2019 Math 154 Schedule

Computational Statistics

Math 154 Schedule, Fall 2019

Jo Hardin
2351 Millikan
jo.hardin@pomona.edu

Office Hours: Monday 1:30-3:30pm, Wednesday 9-11:30am, or by appointment

Mentor Sessions: Jack Hanley
Wednesday 8-10pm
Millikan 1021 (Emmy Noether Room)

Texts:
Required: An Introduction to Statistical Learning (ISL); James, Witten, Hastie, Tibshirani (freely available: http://www-bcf.usc.edu/~gareth/ISL/)

Recommended: Modern Data Science (MDS) with R; Baumer, Kaplan, and Horton (free chapters and other information at: https://github.com/beanumber/mdsr and http://mdsr-book.github.io/)

Recommended: Visual and Statistical Thinking (VST): Displays of Evidence for Making Decisions; Tufte (http://www.edwardtufte.com/tufte/books_textb)

Website for: Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving; Nolan and Temple Lang (http://rdatasciencecases.org/)

Homework:

  • Homework will be assigned from the text with some additional problems. One homework grade will be dropped. Homework will be done using the statistical software package R and posted on GitHub. All homework must be done in R markdown (or R Sweave if you want to use LaTeX). Homework will be due on Thursdays by midnight to GitHub. Non-homework activities (e.g., from the text) may be collected and added to your participation grade.
    • HW should be turned in to your GitHub repository by Thursday.
    • Always post both a PDF and R Markdown (or Sweave) file, unless otherwise requested.
    • HW is graded on a scale of 5/4/3/2/1. See the first HW assignment for more information.
    • HW file should be in the format of: ma154-hw#-lname-fname.pdf

Participation:

  • This class will be interactive, and your participation is expected (every day in class). Although notes will be posted, your participation is an integral part of the in-class learning process.
  • In class:  after answering one question, wait until 5 other people have spoken before answering another question.  [Feel free to ask as many questions as often as you like!]
  • For each midterm, one point will be given for having done the following (before 10/17 and again before 11/26):
    • log on to Piazza (link will be sent via email)
    • using reprex, ask a question about R (can be anonymous to peers, name must be visible to instructor for credit) https://teachdatascience.com/reprex/
    • respond / help a peer who has asked a question about anything (can be anonymous to peers, name must be visible to instructor for credit)

Important Dates:

  • 10/17/19 Exam 1
  • 10/24/19 Take home 1 due (on GitHub by midnight)
  • 11/1/19 Initial Project Proposal due (via email to me by midnight)
  • 11/8/19 Final Project Proposal due (on GitHub by midnight)
  • November 14, 2019 4:15-5:30 Data Science Panel
  • 11/21/19 Project Update due (on GitHub by midnight)
  • 11/26/19 Take home 2 due (on GitHub by midnight)
  • 12/5/19 Exam 2
  • 12/13/19 (Friday) or 12/18/19 (Wednesday) Group Presentations (2-5pm)
  • 12/18/19 Final write-up due (on GitHub by midnight)

Handouts:

 

Date Topic / Chapter Links
Tues

9/3

data science & statistics
(ISL1)
  • Great algorithm for the whole process

http://algorithms-tour.stitchfix.com/

  • Design Challenge (Not So Standard Deviations):

https://simplystatistics.org/2019/01/09/how-data-scientists-think-a-mini-case-study/

  • Video (less than 2 min) on the strengths of reproducible research

https://www.youtube.com/watch?v=s3JldKoA0zw&feature=youtu.be

  • R vs. Python?  (My personal opinion is that neither of the languages is “best”.)

http://www.datasciencecentral.com/profiles/blogs/data-science-wars-r-versus-python

  • Kaggle survey on their users

https://www.kaggle.com/surveys/2017

  • PNAS paper retracted due to problems with figure and reproducibility (April 2016):

http://cardiobrief.org/2016/04/06/pnas-paper-by-prominent-cardiologist-and-dean-retracted/

  • Analysis of Trump’s tweets with evidence that someone else tweets from his account using an iPhone

http://varianceexplained.org/r/trump-tweets/

http://varianceexplained.org/r/trump-followup/

Tues 9/10 visualization
(VST & optional:    MDS 2)
  • Flowcharts for choosing appropriate plots, brief tutorials of viz types, and source code in R and Python.

https://www.data-to-viz.com/

  • See something or Say Something

https://www.flickr.com/photos/walkingsf/sets/72157627140310742/

  • Global terrorism trends (created by students at Grinnell)

http://rstudio.grinnell.edu/Global_Terrorism_Plots/

http://rstudio.grinnell.edu/Global_Terrorism_Map_Basic/

  • Census trends visualized:

http://www.census.gov/dataviz/visualizations/055/

  • Visualization Internship (summer 2016) at 538:

http://fivethirtyeight.com/features/fivethirtyeight-is-hiring-a-data-visualization-intern-for-summer-2016/

  • Best Data Visualizations

http://www.visualisingdata.com/2017/07/10-significant-visualisation-developments-january-june-2017/

  • A new NYT column on visualizations

https://www.nytimes.com/column/whats-going-on-in-this-graph?

  • Studies about visualizations and perception

https://medium.com/@kennelliott/39-studies-about-human-perception-in-30-minutes-4728f9e31a73

  • Fundamentals of Data Visualization

http://serialmentor.com/dataviz/

Tues 9/17 data wrangling
(MDS 4, free here)
 
Tues 9/24 simulating
(optional:  MDS 8)
  • Simulating who will be in the first GOP debate (NYT 7/29/15)

http://www.nytimes.com/interactive/2015/07/21/upshot/election-2015-the-first-gop-debate-and-the-role-of-chance.html

Tues 10/1 permutation
tests
  • Statistics without the agonizing pain – John Rauser

https://www.youtube.com/watch?v=5Dnw46eC-0o

  • The algorithm that could end partisan gerrymandering

https://www.youtube.com/watch?v=gRCZR_BbjTo&t=125s

Tues 10/8 bootstrapping
(ISL 5)
  • Daniela Witten (11:40 – 19:40) at Simply Statistics Unconference on the Future of Statistics

https://www.youtube.com/watch?v=Y4UJjzuYjfM

  • Five ways to fix statistics — Nature Nov 28, 2017

https://www.nature.com/articles/d41586-017-07522-z

Tues 10/15  catch-up & review  
Thurs 10/17 exam1
  • https://teachdatascience.com/diversity/
  • organizations / twitter handles celebrating diversity in statistics, ML, AI, data, etc.

    #Data4BlackLives

    #WiML2018

    @black_in_ai⁩

    @_LXAI

    @InclusionInML⁩

    @RLadiesGlobal

    @RLadiesLA

Tues 10/22 fall break / ethics

(MDS 6, free here)

 

  • When algorithms discriminate:

https://www.nytimes.com/2015/07/10/upshot/when-algorithms-discriminate.html?mcubz=0&_r=0

  • Is special education racist?

https://www.nytimes.com/2015/06/24/opinion/is-special-education-racist.html?mcubz=0

Thurs 10/24 take home 1 due
  • What is it that we can learn (or not) from statistical models & machine learning? Series of podcasts by Hilary Parker (PO ’08) and Roger Peng.

https://soundcloud.com/nssd-podcast/episode-4-a-gajillion-time-series/

Tues 10/29 initial project proposal due Tues

k-nn, trees
(ISL 4, 5, 8)

  • Why the Bronx really burned: “adjusting” data to give the wrong information

http://fivethirtyeight.com/datalab/why-the-bronx-really-burned/

  • SF vs. NYC housing (trees)

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Tues 11/5 final project proposal due Tues

bagging, random forests
(ISL 8)

  • The end of science:

http://www.wired.com/2008/06/pb-theory/

  • Maybe not so fast:

http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/

Tues 11/12 support vector machines
(ISL 9)
  • ROC curve of science

http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/

Tues 11/19  

Clustering

(ISL 10)

 

 

  • Fantastic k-means applet:

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

  • Analyzing networks of characters in ‘Love Actually’

http://varianceexplained.org/r/love-actually-network/

  • Network Analysis of political books — Bridging the divide: political books

https://espresso.economist.com/6412121cbb2dc2cb9e460cfee7046be2

Tues 11/26 Project tidbits (API, SQL, authenticating, parallel, cloud)

 

take home 2 due

  •  The Statistics Identity Crisis

https://www.youtube.com/watch?v=JLs01Z5baSU

  • Write your own R package

https://stat545-ubc.github.io/packages00_index.html

https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio

 

Tues 12/3 catch-up & review

Thurs exam2

  • Data Science jobs in high demand:

https://www.bloomberg.com/news/articles/2017-08-21/here-s-a-retail-job-that-s-still-in-high-demand-data-scientist

  • SAMSI: Workshop on Distributed Data Analysis with Applications in Finance and Healthcare (March 2016)

http://www.samsi.info/workshop/workshop-distributed-data-analysis-applications-finance-and-healthcare-march-21-22-2016

  • 2016 Statistical Sciences Symposium on Statistical Machine Learning: Theory and Methods, UC Davis, April 23, 2016

http://www.stat.ucdavis.edu/seminars/conferences/index.html

  • UCLA Datafest (April 5-7, 2019)

http://fivethirtyeight.com/datalab/the-students-most-likely-to-take-our-jobs/

http://datafest.stat.ucla.edu/

https://dataskeptic.com/blog/episodes/2015/data-fest-2015

 

Tues 12/10 text analysis (NLP)
Fri 12/13 & Wed 12/18 

2-5pm


Final write-up due Wed 12/18 midnight

Group Presentations (schedule TBA) Some project examples / ideas: