PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz

UCSB

Join Slack workspace, monitor channel #22f-pstat197a for announcements.

Office hours by demand immediately following section and class meetings

- Yan will hold a drop in hour in building 434 room 126 after Josh’s section

Install course software and bring your laptop to section meetings. Remember your table number from today.

Data science emerged as a term of art in the last decade

Interest exploded in the last five years

Tukey advocated for ‘data analysis’ as a broader field than statistics (Tukey 1962), including:

statistical theory and methodology;

visualization and data display techniques;

computation and scalability;

breadth of application.

Look famililar? Tukey’s ‘data analysis’ is proto-modern data science.

In the 1960’s and 1970’s, these concepts meant very different things.

visualization meant

*drawing*computation meant

*data re-expression by hand*

But the ideas were still somewhat radical. At the time most relied on highly reductive numerical results to interpret data:

ANOVA tables

regression tables

p-values

The new techniques allowed for ** iterative** investigation:

formulate a question

examine data graphics and summaries

adjust computations and graphics to hone in on content of interest

refine the question

Suppose we want to explain variation in birth-to-death ratios in the U.S. ^{1}

Initial question: is population density an associated factor?

It’s worth noting that in the first half of the 20th century, much of statistics focused on methodology and theory for the analysis of small *iid* samples, and in particular:

inference on means and inference on tables;

analysis of variance;

tests of distribution.

The inferential framework brought to bear on these ‘simpler’ problems largely carried over when the field began to specialize.

From 1960-2010, adopters of the ‘data analysis as a field’ view were largely industry practitioners and applied statisticians who advocated for training and practice that included empirical methods and computation in addition to statistical inference (Donoho 2017).

Their ideas evolved into an alternative approach to working with data:

data-driven rather than theory-driven;

iterative rather than conclusive.

The “confirmatory” approach of the classical inferential framework.

output is a decision

statistical model determined by experimental design

analysis based on statistical theory

The “exploratory” approach of iterative modern data analysis.

outputs are findings

statistical model determined by data

analysis techniques include empirical methods

In the 2000s and especially after 2010, the iterative approach enjoys broader applicability than it used to:

due to automated and/or scalable data collection

observational data is widely available across domains

and includes large numbers of variables

highly specialized data problems evade methodology with theoretical support

more accessible to analysts without advanced statistical training

Machine learning was largely advanced by computer scientists through 2010 and later (Emmert-Streib et al. 2020), most notably:

neural networks and deep learning

optimization

algorithmic analysis

This was a major driver in advancing modern predictive modeling, and engaging with these tools required going beyond statistics.

Around mid-century, it was proposed that specialists should be trained in computational as well as statistical methods

Over time practitioners developed iterative processes for data-driven problem solving that was more flexible than the classical inferential framework

Computer scientists advanced the field of machine learning substantially

Iterative problem solving together with applied machine learning was well-suited to meet the demands of modern data, but the area was not codified in an academic discipline

Research is *systematic investigation* undertaken in order to establish or discover facts.

What are facts in data science?

method

*M*outperforms method*M’*at task*T*we analyzed data

*D*and reached the conclusion that…

Formal communities – *i.e.,* journals, departments, conferences – have not coalesced around data science research to date.

Relevant research largely occurs in statistics, computer science, and application domains, and can be divided broadly into:

methodology – creating new techniques to analyze data

applications – applying existing methods to generate new findings

Methodological research might involve:

designing a faster algorithm for solving a particular problem

proposing a new technique for analyzing a particular type of data

generalizing a technique to a broader range of problems

Applied research might involve:

analyzing a specific dataset or producing a novel analysis of existing data

creating ad-hoc methods for a domain-specific problem

importing methodology from another area to bear on a domain-specific problem

Most of the time, our data science capstones fall pretty squarely in the applied domain:

sponsor provides data and high-level goals

student team works on producing an analysis or analyses

mentor advises on methodology

There are a few avenues to prepare for this sort of work.

We’ll focus on:

recognizing problem patterns

developing a functional view of methodology

collaborating efficiently

independent learning strategies

engaging with literature constructively

It won’t provide you with exhaustive methodological preparation, but should support you in learning ‘on the job’.

Questions on the perspectives paper (Peng and Parker 2022) to review:

- What is meant by a ‘systems approach’ to data science?
- What is meant by ‘design thinking’ in data science?
- (Why) Are these useful concepts?

Several systems affect the relationship between expected and actual results. Where would you locate them on the figure?

Data analytic

Software

Scientific

How might this diagram help an analyst?

The design thinking framework might be summed up:

data scientists trade in data analyses

a data analysis is a designed product

thinking about design principles can help make a better product

Many of you focused on how design principles are a response to project constraints. Are there other ways a design perspective might be useful?

You’re working at a news organization and developing a recommender system for targeted article previews to deploy on the organization’s website. It will show users article previews based on their behavior. Assume you don’t have any significant resource constraints, and can access users’ profiles in full and log interactions in near-real-time.

** Goal:** show previews most likely to attract interest.

Considerations:

what material should be shown in the preview? headlines? images? text?

what behavior can/should be leveraged for the recommender system?

what are a few relevant design aspects of how the system should behave?

are there ethical concerns?

You’re working on a research team studying ecological impacts of land use. The team has access to longitudinal species surveys at locations of interest across the U.S., quarterly county-level land allocation statistics, satellite images, and state budget information for sustainability, restoration, and conservation initiatives.

** Goal:** identify intervention opportunities that are most likely to positively impact ecological diversity.

Considerations:

what data would you use and how would you combine data sources?

are there external data that might be useful?

what analysis outputs would be most important for identifying intervention opportunities?

can you think of other design features that might be useful for the data analysis?

Let’s look at some design principles from (McGowan, Peng, and Hicks 2021).

We’ll do a github icebreaker activity.

Complete lab activity from Wednesday section meeting

Bring laptops

Donoho, David. 2017. “50 Years of Data Science.” *Journal of Computational and Graphical Statistics* 26 (4): 745–66.

Emmert-Streib, Frank, Zhen Yang, Han Feng, Shailesh Tripathi, and Matthias Dehmer. 2020. “An Introductory Review of Deep Learning for Prediction Models with Big Data.” *Frontiers in Artificial Intelligence* 3: 4.

McGowan, Lucy D’Agostino, Roger D Peng, and Stephanie C Hicks. 2021. “Design Principles for Data Analysis.” *arXiv Preprint arXiv:2103.05689*.

Peng, Roger D, and Hilary S Parker. 2022. “Perspective on Data Science.” *Annual Review of Statistics and Its Application* 9: 1–20.

Tukey, John W. 1962. “The Future of Data Analysis.” *The Annals of Mathematical Statistics* 33 (1): 1–67.

Tukey, John W et al. 1977. *Exploratory Data Analysis*. Vol. 2. Reading, MA.