Course materials

Introductory module

Objectives: set expectations; explore data science raison d’etre; introduce systems and design thinking; introduce software tools and collaborative coding; conduct exploratory/descriptive analysis of class background and interests.

Week 0

Thursday meeting: Course orientation [slides]
Assignments due by next class meeting:
- install course software and create github account;
- fill out intake form
- read Peng and Parker (2022);
- prepare a reading response

Week 1

Tuesday meeting: On projects in(volving) data science [slides]
Section meeting: software and technology overview [activity]
Thursday meeting: basic GitHub actions [activity] [slides]
Assignments due by next class meeting:
- read MDSR 9.1 – 9.2
- prepare a reading response

Week 2

Tuesday meeting: Introducing class intake survey data [slides]
Section meeting: tidyverse basics [activity]
Thursday meeting: planning group work for analysis of survey data [slides]
Assignments:
- first team assignment due Sunday, October 19, 11:59 PM PST [accept via GH classroom here]

Module 1: biomarker identification

Objectives: introduce variable selection, classification, and multiple testing problems; discuss classification accuracy metrics and data partitioning; fit logistic regression and random forest classifiers in R; learn to implement multiple testing corrections for FDR control (Benjamini-Hochberg and Benjamini-Yekutieli); discuss selection via penalized estimation. Data from Hewitson et al. (2021) .

Week 3

Tuesday meeting: introducing biomarker data; multiple testing [slides]
Section meeting: iteration strategies [activity]
Thursday meeting: correlation analysis; random forests [slides] [activity]
Assignments due by next class meeting:
- read MDSR 10.1 - 10.2
- read Hewitson et al. (2021)
- prepare a reading response

Week 4

Tuesday meeting: random forests cont’d; logistic regression [slides]
Section meeting: logistic regression and classification metrics [activity]
Thursday meeting: LASSO regularization [slides]
Assignments:
- second group assignment due Friday, October 31, 11:59pm PST [accept via GH classroom]

Module 2: fraud claims

Objectives: introduce NLP techniques for converting text to data and web scraping tools in R; discuss dimension reduction techniques; introduce multiclass classification; learn to process text, fit multinomial logistic regression models, and train neural networks in R.

Week 5

Tuesday meeting: data introduction and basic NLP techniques [slides]
Section meeting: string manipulation and text processing in R [activity]
Thursday meeting: dimension reduction; multinomial logistic regression [slides] [activity]
Optional further reading:
- MDSR Ch. 19
- Cambria and White (2014)
- Khan et al. (2010)

Week 6

Tuesday meeting: feedforward neural networks [slides]
Section meeting: fitting neural nets with keras [activity]
Thursday meeting: NO CLASS
Assignments:
- Midquarter assessments [form]
- Request winter add code [form]
- Read Emmert-Streib et al. (2020) (§1-5, §9) and prepare a reading response
- third group assignment due Monday, November 14, 11:59pm PST [accept via GH classroom] [group assignments]
Optional further reading:
- Alzubaidi et al. (2021)
- Goodfellow, Bengio, and Courville (2016) Ch. 6 (advanced)

References

Alzubaidi, Laith, Jinglan Zhang, Amjad J Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, José Santamarı́a, Mohammed A Fadhel, Muthana Al-Amidie, and Laith Farhan. 2021. “Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions.” Journal of Big Data 8 (1): 1–74.

Cambria, Erik, and Bebo White. 2014. “Jumping NLP Curves: A Review of Natural Language Processing Research.” IEEE Computational Intelligence Magazine 9 (2): 48–57.

Emmert-Streib, Frank, Zhen Yang, Han Feng, Shailesh Tripathi, and Matthias Dehmer. 2020. “An Introductory Review of Deep Learning for Prediction Models with Big Data.” Frontiers in Artificial Intelligence 3: 4.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Hewitson, Laura, Jeremy A Mathews, Morgan Devlin, Claire Schutte, Jeon Lee, and Dwight C German. 2021. “Blood Biomarker Discovery for Autism Spectrum Disorder: A Proteomic Analysis.” PLoS One 16 (2): e0246581.

Khan, Aurangzeb, Baharum Baharudin, Lam Hong Lee, and Khairullah Khan. 2010. “A Review of Machine Learning Algorithms for Text-Documents Classification.” Journal of Advances in Information Technology 1 (1): 4–20.

Peng, Roger D, and Hilary S Parker. 2022. “Perspective on Data Science.” Annual Review of Statistics and Its Application 9: 1–20.