standing | n |
---|---|

Junior | 9 |

Senior | 56 |

PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz

UCSB

add

`.DS_Store`

to`.gitignore`

open repo project in RStudio session (not another project or new session)

repo clone directory must be kept intact; can move the entire directory but not individual files

use client not terminal, at least to start out

others?

review sampling concepts

introduce class survey data

present descriptive analysis

** population:** collection of all subjects/units of interest

** sample:** subjects/units observed in a study

statistical methodology strives to account for the possibility that the sample could have been different in order to make reliable inferences about the population based on knowledge of the sampling mechanism

Even if inference isn’t possible, data still have value and could be used for:

descriptive analysis of the sample;

hypothesis generation;

developing analysis pipelines.

Prediction is a separate goal but still a form of generalization.

- samples must reflect a broader population for predictions to be accurate at the population level

- if an analyst can’t expect sample statistics to provide reliable estimates of population quantities, they shouldn’t expect predictions based on the sample to be reliable either

Several issues arise *very* often in practice that compromise or complicate an analyst’s ability to make inferences (or predictions). Among them:

scope of inference from the sample doesn’t match the study population

subjects/units are selected haphazardly or by convenience

researcher conflates sample size with number of observations,

*i.e.,*takes lots of measurements on few subjects/units

The following questions can help make an assessment of the scope of inference:

(protocol) how were subjects/units chosen for measurement and how were measurements collected?

(mechanism) was there any random selection mechanism?

(exclusion) are there any subjects/units that couldn’t possibly have been chosen?

(nonresponse) were any subjects/units selected but not measured?

survey distributed to all students offered enrollment in PSTAT197A fall 2022

\(n = 65\) responses

includes a few students who did not enroll

does not include several students who did not enroll

does not include one student who enrolled late

no random selection

From the reading responses:

It depends on the question. If you want to draw conclusions about the pstat197a class specifically, this sample is the population and thus will have reliable data. If you want to draw conclusions about the pstat department as a whole, then this is a bad sample because it is likely biased and thus unreliable

The comment points to two ways to view the data:

a census of PSTAT197A enrollees

a convenience sample of…

capstone applicants OR

students qualified for capstones OR

students interested in data science OR

all UCSB students???

Either way – census or convenience sample – excludes inference.

census \(\longrightarrow\) no inference needed

convenience \(\longrightarrow\) no inference possible

So on a practical level, it won’t make much difference for designing an analysis of the survey data.

Any analysis of survey data should be regarded as *descriptive* in nature:

summary statistics and/or models are not reliable measures of any broader population

results should be interpreted narrowly in terms of the sample at hand

**Start simple and add complexity gradually.**

From simpler to more complex consider questions involving:

- Sample characteristics
- Single-variable summaries
- Multivariate summaries
- Model-based outputs (estimates, predictions, etc.)

**Sample characteristics**

- Is the proportion of men/women in the class equal (taking into account randomness)?

**Single-variable summaries**

Among the students offered a seat in PSTAT197, what fields of study are the students most interested in?

What level of comfort do students interested in data analysis at UCSB have with mathematics?

**Multivariate summaries**

- Are students who ranked themselves as strong in statistics, mathematics, and computing more likely or less likely to select an ‘industry’ project as the project type that they want to work on?

**Model-based outputs**

- Are there distinct groups of students in the class defined by self-assessed proficiencies and/or comfort levels with mathematics, statistics, and programming?

Is the proportion of men/women in the class equal (taking into account randomness)?

standing | n |
---|---|

Junior | 9 |

Senior | 56 |

gender | n |
---|---|

Female | 25 |

Male | 40 |

race | n |
---|---|

Asian | 41 |

Caucasian | 17 |

Prefer not to say | 6 |

Unknown | 1 |

**Columns**: consent to share project preferences

**Rows**: consent to share background and preparation

No | Yes | |
---|---|---|

No | 3 | 3 |

Yes | 2 | 57 |

The following information have been removed from the dataset distributed to the class:

personal information from section 1 of the survey

long text and free response answers, contain some personal details

responses from students who did not consent to share

type distinction between research experiences

What level of comfort do students interested in data analysis at UCSB have with mathematics?

variable | max | mean | median | min |
---|---|---|---|---|

math.comf | 5 | 3.847458 | 4 | 2 |

prog.comf | 5 | 3.966102 | 4 | 3 |

stat.comf | 5 | 4.084746 | 4 | 2 |

variable | mean | median |
---|---|---|

math | 2.355932 | 2 |

prog | 2.237288 | 2 |

stat | 2.576271 | 3 |

prog | n1 | math | n2 | stat | n3 |
---|---|---|---|---|---|

Beg | 3 | Beg | 3 | Beg | 2 |

Int | 39 | Int | 32 | Int | 21 |

Adv | 17 | Adv | 24 | Adv | 36 |

Are students who ranked themselves as strong in statistics, mathematics, and computing more likely or less likely to select an ‘industry’ project as the project type that they want to work on?

mean.proficiency.fac | both | ind | lab |
---|---|---|---|

[1,2.33] | 6 | 25 | 1 |

(2.33,2.67] | 5 | 9 | 1 |

(2.67,3] | 3 | 6 | 1 |

mean.proficiency.fac | both | ind | lab | n |
---|---|---|---|---|

[1,2.33] | 0.188 | 0.781 | 0.031 | 32 |

(2.33,2.67] | 0.333 | 0.600 | 0.067 | 15 |

(2.67,3] | 0.300 | 0.600 | 0.100 | 10 |

Consider the distinct *combinations* of comfort and proficiency ratings (separately):

prog | math | stat | n |
---|---|---|---|

1 | 1 | 1 | 1 |

1 | 2 | 2 | 1 |

1 | 2 | 3 | 1 |

2 | 1 | 1 | 1 |

2 | 1 | 2 | 1 |

2 | 2 | 2 | 13 |

2 | 2 | 3 | 11 |

2 | 3 | 2 | 3 |

2 | 3 | 3 | 10 |

3 | 2 | 2 | 2 |

3 | 2 | 3 | 4 |

3 | 3 | 2 | 1 |

3 | 3 | 3 | 10 |

prog | math | stat | n |
---|---|---|---|

3 | 2 | 2 | 1 |

3 | 3 | 3 | 2 |

3 | 3 | 4 | 2 |

3 | 3 | 5 | 2 |

3 | 4 | 3 | 4 |

3 | 4 | 4 | 7 |

4 | 3 | 3 | 2 |

4 | 3 | 4 | 5 |

4 | 3 | 5 | 2 |

4 | 4 | 3 | 1 |

4 | 4 | 4 | 5 |

4 | 4 | 5 | 3 |

4 | 5 | 3 | 1 |

4 | 5 | 4 | 4 |

4 | 5 | 5 | 2 |

5 | 3 | 4 | 5 |

5 | 3 | 5 | 1 |

5 | 4 | 4 | 2 |

5 | 4 | 5 | 1 |

5 | 5 | 4 | 1 |

5 | 5 | 5 | 6 |

Can students be grouped based on combinations of preferences and comfort levels?

prog.prof | math.prof | stat.prof | prog.comf | math.comf | stat.comf | size | cluster |
---|---|---|---|---|---|---|---|

2.478 | 2.739 | 2.957 | 4.435 | 4.522 | 4.522 | 23 | 1 |

2.048 | 1.857 | 2.238 | 4.048 | 3.048 | 4.048 | 21 | 2 |

2.133 | 2.467 | 2.467 | 3.133 | 3.933 | 3.467 | 15 | 3 |

Clustering method, “k means”, groups data by nearest Euclidean distance to each of \(k\) centers. \(k\) is user-specified; the method finds the centers that minimize within-cluster variance.

Based on the centers:

Cluster 1: advanced proficiency, very comfortable

Cluster 2: intermediate with less mathematical preparation

Cluster 3: intermediate with less programming preparation

Your task is to extend this analysis with your group by next Tuesday.

Here are some ideas:

explore variable associations further (

*e.g.,*coursework and self-evaluations)experiment with clustering on different variable subsets or using different methods

summarize domain or area of interest variables (requires some text manipulation)

Most of next meeting we’ll devote to planning your group’s task.

Do a little brainstorming on your own

Come with a few questions/ideas