Sampling concepts and descriptive analysis

PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz

UCSB

Lessons from last time

  • add .DS_Store to .gitignore

  • open repo project in RStudio session (not another project or new session)

  • repo clone directory must be kept intact; can move the entire directory but not individual files

  • use client not terminal, at least to start out

  • others?

Today

  • review sampling concepts

  • introduce class survey data

  • present descriptive analysis

Sampling concepts

Samples and populations

population: collection of all subjects/units of interest

sample: subjects/units observed in a study

statistical methodology strives to account for the possibility that the sample could have been different in order to make reliable inferences about the population based on knowledge of the sampling mechanism

What if inferences aren’t possible?

Even if inference isn’t possible, data still have value and could be used for:

  • descriptive analysis of the sample;

  • hypothesis generation;

  • developing analysis pipelines.

What about prediction?

Prediction is a separate goal but still a form of generalization.

  • samples must reflect a broader population for predictions to be accurate at the population level
  • if an analyst can’t expect sample statistics to provide reliable estimates of population quantities, they shouldn’t expect predictions based on the sample to be reliable either

Common problems

Several issues arise very often in practice that compromise or complicate an analyst’s ability to make inferences (or predictions). Among them:

  • scope of inference from the sample doesn’t match the study population

  • subjects/units are selected haphazardly or by convenience

  • researcher conflates sample size with number of observations, i.e., takes lots of measurements on few subjects/units

Helpful questions

The following questions can help make an assessment of the scope of inference:

  • (protocol) how were subjects/units chosen for measurement and how were measurements collected?

  • (mechanism) was there any random selection mechanism?

  • (exclusion) are there any subjects/units that couldn’t possibly have been chosen?

  • (nonresponse) were any subjects/units selected but not measured?

Class survey data

  • survey distributed to all students offered enrollment in PSTAT197A fall 2022

  • \(n = 65\) responses

    • includes a few students who did not enroll

    • does not include several students who did not enroll

    • does not include one student who enrolled late

  • no random selection

Can the data support inference?

From the reading responses:

It depends on the question. If you want to draw conclusions about the pstat197a class specifically, this sample is the population and thus will have reliable data. If you want to draw conclusions about the pstat department as a whole, then this is a bad sample because it is likely biased and thus unreliable

Alternative perspectives

The comment points to two ways to view the data:

  • a census of PSTAT197A enrollees

  • a convenience sample of…

    • capstone applicants OR

    • students qualified for capstones OR

    • students interested in data science OR

    • all UCSB students???

Is there a right answer?

Either way – census or convenience sample – excludes inference.

  • census \(\longrightarrow\) no inference needed

  • convenience \(\longrightarrow\) no inference possible

So on a practical level, it won’t make much difference for designing an analysis of the survey data.

Descriptive analysis

Any analysis of survey data should be regarded as descriptive in nature:

  • summary statistics and/or models are not reliable measures of any broader population

  • results should be interpreted narrowly in terms of the sample at hand

Descriptive analysis

A general approach

Start simple and add complexity gradually.

From simpler to more complex consider questions involving:

  1. Sample characteristics
  2. Single-variable summaries
  3. Multivariate summaries
  4. Model-based outputs (estimates, predictions, etc.)

Questions of interest

Sample characteristics

  • Is the proportion of men/women in the class equal (taking into account randomness)?

Single-variable summaries

  • Among the students offered a seat in PSTAT197, what fields of study are the students most interested in?

  • What level of comfort do students interested in data analysis at UCSB have with mathematics?

Questions of interest

Multivariate summaries

  • Are students who ranked themselves as strong in statistics, mathematics, and computing more likely or less likely to select an ‘industry’ project as the project type that they want to work on?

Model-based outputs

  • Are there distinct groups of students in the class defined by self-assessed proficiencies and/or comfort levels with mathematics, statistics, and programming?

Sample characteristics

Is the proportion of men/women in the class equal (taking into account randomness)?

standing n
Junior 9
Senior 56
gender n
Female 25
Male 40
race n
Asian 41
Caucasian 17
Prefer not to say 6
Unknown 1

Columns: consent to share project preferences

Rows: consent to share background and preparation

No Yes
No 3 3
Yes 2 57

Majors

Response timing

Privacy

The following information have been removed from the dataset distributed to the class:

  • personal information from section 1 of the survey

  • long text and free response answers, contain some personal details

  • responses from students who did not consent to share

  • type distinction between research experiences

Single-variable summaries

What level of comfort do students interested in data analysis at UCSB have with mathematics?

variable max mean median min
math.comf 5 3.847458 4 2
prog.comf 5 3.966102 4 3
stat.comf 5 4.084746 4 2
variable mean median
math 2.355932 2
prog 2.237288 2
stat 2.576271 3
prog n1 math n2 stat n3
Beg 3 Beg 3 Beg 2
Int 39 Int 32 Int 21
Adv 17 Adv 24 Adv 36

Multivariable summaries

Are students who ranked themselves as strong in statistics, mathematics, and computing more likely or less likely to select an ‘industry’ project as the project type that they want to work on?

mean.proficiency.fac both ind lab
[1,2.33] 6 25 1
(2.33,2.67] 5 9 1
(2.67,3] 3 6 1
mean.proficiency.fac both ind lab n
[1,2.33] 0.188 0.781 0.031 32
(2.33,2.67] 0.333 0.600 0.067 15
(2.67,3] 0.300 0.600 0.100 10

Combinations

Consider the distinct combinations of comfort and proficiency ratings (separately):

prog math stat n
1 1 1 1
1 2 2 1
1 2 3 1
2 1 1 1
2 1 2 1
2 2 2 13
2 2 3 11
2 3 2 3
2 3 3 10
3 2 2 2
3 2 3 4
3 3 2 1
3 3 3 10
prog math stat n
3 2 2 1
3 3 3 2
3 3 4 2
3 3 5 2
3 4 3 4
3 4 4 7
4 3 3 2
4 3 4 5
4 3 5 2
4 4 3 1
4 4 4 5
4 4 5 3
4 5 3 1
4 5 4 4
4 5 5 2
5 3 4 5
5 3 5 1
5 4 4 2
5 4 5 1
5 5 4 1
5 5 5 6

Clustering

Can students be grouped based on combinations of preferences and comfort levels?

prog.prof math.prof stat.prof prog.comf math.comf stat.comf size cluster
2.478 2.739 2.957 4.435 4.522 4.522 23 1
2.048 1.857 2.238 4.048 3.048 4.048 21 2
2.133 2.467 2.467 3.133 3.933 3.467 15 3

Clustering method, “k means”, groups data by nearest Euclidean distance to each of \(k\) centers. \(k\) is user-specified; the method finds the centers that minimize within-cluster variance.

Based on the centers:

  • Cluster 1: advanced proficiency, very comfortable

  • Cluster 2: intermediate with less mathematical preparation

  • Cluster 3: intermediate with less programming preparation

Assignment

Your task is to extend this analysis with your group by next Tuesday.

Here are some ideas:

  • explore variable associations further (e.g., coursework and self-evaluations)

  • experiment with clustering on different variable subsets or using different methods

  • summarize domain or area of interest variables (requires some text manipulation)

Next time

Most of next meeting we’ll devote to planning your group’s task.

  • Do a little brainstorming on your own

  • Come with a few questions/ideas