```
# A tibble: 4 × 2
# Groups: group [2]
group ados
<chr> <dbl>
1 ASD 20
2 ASD 10
3 TD NA
4 TD NA
```

PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz

UCSB

Make your final commits for the first group assignment by *Friday 10/14 11:59pm PST*

- should include an updated
`report.qmd`

and`report.html`

with your write-up

Next group assignment to be distributed Tuesday 10/18.

Groups will be randomly assigned

Task: replicate and redesign proteomic analysis

introduced ASD proteomic dataset

used multiple testing corrections to identify proteins whose serum levels differ significantly between ASD/TD groups

discussed the difference between controlling familywise error vs. false discovery

We’ll talk about two more approaches to identifying proteins of interest:

correlation-based identification of proteins

random forest classifier

Autism Diagnostic Observation Schedule (ADOS) scores are determined by psychological assessment and measure ASD severity.

```
# A tibble: 4 × 2
# Groups: group [2]
group ados
<chr> <dbl>
1 ASD 20
2 ASD 10
3 TD NA
4 TD NA
```

`ados`

is only measured for the ASD groupnumerical score between 6 and 23 (at least in this sample)

So here’s an idea:

compute correlations of each protein with ADOS;

pick the 10 proteins with the strongest correlation

This is a simple aggregation operation and can be executed in R with `summarize()`

:

```
# compute correlations
asd_clean %>%
pivot_longer(cols = -ados,
names_to = 'protein',
values_to = 'level') %>%
group_by(protein) %>%
summarize(correlation = cor(ados, level))
```

```
# A tibble: 1,317 × 2
protein correlation
<chr> <dbl>
1 14-3-3 0.0970
2 14-3-3 protein beta/alpha 0.0481
3 14-3-3 protein theta 0.122
4 14-3-3 protein zeta/delta 0.0735
5 14-3-3E -0.127
6 17-beta-HSD 1 0.0969
7 3HAO 0.0773
8 3HIDH 0.0452
9 4-1BB 0.0290
10 4-1BB ligand 0.0799
# … with 1,307 more rows
```

```
# A tibble: 10 × 4
protein correlation abs.corr rank
<chr> <dbl> <dbl> <int>
1 CO8A1 0.362 0.362 1317
2 C5b, 6 Complex -0.337 0.337 1
3 Thrombospondin-1 0.310 0.310 1316
4 ILT-2 -0.309 0.309 2
5 TRAIL R4 0.296 0.296 1315
6 HCE000414 -0.296 0.296 3
7 C1r -0.290 0.290 4
8 GM-CSF 0.290 0.290 1314
9 Angiogenin -0.284 0.284 5
10 HCG 0.278 0.278 1313
```

*Fact:* the simple linear regression coefficient estimate is proportional to the correlation coefficient.

So it should give similar results to sort the SLR coefficients by significance.

```
# A tibble: 1,317 × 3
protein estimate p.value
<chr> <dbl> <dbl>
1 CO8A1 0.391 0.00131
2 C5b, 6 Complex -0.401 0.00290
3 Thrombospondin-1 0.317 0.00635
4 ILT-2 -0.546 0.00664
5 TRAIL R4 0.353 0.00933
6 HCE000414 -0.278 0.00950
7 C1r -0.341 0.0110
8 GM-CSF 0.345 0.0110
9 Angiogenin -0.314 0.0130
10 HCG 0.306 0.0149
# … with 1,307 more rows
```

If we do the correlation analysis this way, do the identified proteins pass multiple testing significance thresholds?

```
# A tibble: 10 × 3
protein p.value p.adj
<chr> <dbl> <dbl>
1 CO8A1 0.00131 0.0145
2 C5b, 6 Complex 0.00290 0.0342
3 Thrombospondin-1 0.00635 0.112
4 ILT-2 0.00664 0.0641
5 TRAIL R4 0.00933 1.29
6 HCE000414 0.00950 0.916
7 C1r 0.0110 0.242
8 GM-CSF 0.0110 0.124
9 Angiogenin 0.0130 0.142
10 HCG 0.0149 0.159
```

probably just introducing selection noise

but also, this result diverges considerably from the paper (?)

The SLR approach is equivalent to sorting the correlations. (Take a moment to check the results.)

We could use inference on the population correlation to obtain a \(p\)-value associated with each sample correlation coefficient. These match the ones from SLR.

```
cor_test <- function(x, y){
cor_out <- cor.test(x, y)
tibble(estimate = cor_out$estimate,
p.value = cor_out$p.value)
}
asd_clean %>%
pivot_longer(cols = -ados,
names_to = 'protein',
values_to = 'level') %>%
group_by(protein) %>%
summarize(correlation = cor_test(ados, level)) %>%
unnest(correlation) %>%
arrange(p.value)
```

```
# A tibble: 1,317 × 3
protein estimate p.value
<chr> <dbl> <dbl>
1 CO8A1 0.362 0.00131
2 C5b, 6 Complex -0.337 0.00290
3 Thrombospondin-1 0.310 0.00635
4 ILT-2 -0.309 0.00664
5 TRAIL R4 0.296 0.00933
6 HCE000414 -0.296 0.00950
7 C1r -0.290 0.0110
8 GM-CSF 0.290 0.0110
9 Angiogenin -0.284 0.0130
10 HCG 0.278 0.0149
# … with 1,307 more rows
```

A ** binary tree** is a directed graph in which:

there is at most one path between any two nodes

each node has at most two outward-directed edges

A ** classification tree** is a binary tree in which the paths represent classification rules.

Say we want to predict income based on captial gains and education level using census data.

```
# A tibble: 6 × 3
# Groups: income [2]
income educ capital_gain
<fct> <fct> <dbl>
1 <=50K hs 0
2 <=50K advanced 0
3 <=50K college 0
4 >50K hs 7688
5 >50K hs 7688
6 >50K college 5178
```

We could construct a classification tree by ‘splitting’ based on the values of predictor variables.

To get a sense of the process of tree construction, we’ll do an activity in groups: each group will build a tree ‘by hand’.

first let’s look at the instructions together

then take about 15-20 minutes to do it

A random forest is a classifier based on many trees. It is constructed by:

building some large number of \(T\) trees using bootstrap samples and random subsets of predictors (what you just did, repeated many times)

taking a majority vote across all trees to determine the classification

So let’s take a vote using your trees!

If the number of trees \(T\) is large (as it should be):

trees are built using lots of random subsets of predictors

can keep track of which ones are used most often to define splits

** Variable importance scores** provide a measure of how influential each predictor is in a random forest.

Back to the proteomics data, the variable importance scores from a random forest provide another means of identifying proteins.

```
# grow RF
set.seed(101222)
rf_out <- randomForest(x = asd_preds, y = asd_resp,
mtry = 100, ntree = 1000,
importance = T)
# variable importance
rf_out$importance %>%
as_tibble() %>%
mutate(protein = rownames(rf_out$importance)) %>%
slice_max(MeanDecreaseGini, n = 10) %>%
select(protein)
```

```
# A tibble: 10 × 1
protein
<chr>
1 DERM
2 IgD
3 TGF-b R III
4 TSP4
5 Notch 1
6 MAPK14
7 RELT
8 ERBB1
9 CK-MB
10 SOST
```

But how accurate is the predictor?

ASD | TD | class.error | |
---|---|---|---|

ASD | 53 | 23 | 0.3026316 |

TD | 19 | 59 | 0.2435897 |

logistic regression

variable selection

a design view of the proteomic analysis