Attribute | ASD (n = 76) | TD (n = 78) |
---|---|---|
Age: mean (SD) years | 5.6 (1.7) | 5.7 (2.0) |
PSTAT197A/CMPSC190DD Fall 2022
UCSB
don’t forget to fill out attendance form for each class meeting
first group assignment due Friday 10/14 11:59pm PST
labs/labN-TITLE-USERNAME.R
section attendance is expected
Levels of proteins in plasma/serum are altered in autism spectrum disorder (ASD).
Goal: identify a panel of proteins useful as a blood biomarker for early detection of ASD.
a ‘panel’ is a handful of tests that help distinguish between conditions
so in other words, find proteins whose serum levels are predictive of ASD
Data from Hewitson et al. (2021)
Serum samples from 76 boys with ASD and 78 typically developing (TD) boys, 18 months-8 years of age
A total of 1,125 proteins were analyzed from each sample
1,317 measured, 192 failed quality control
(we don’t know which ones failed QC so will use all)
Attribute | ASD (n = 76) | TD (n = 78) |
---|---|---|
Age: mean (SD) years | 5.6 (1.7) | 5.7 (2.0) |
Attribute | ASD (n = 76) | TD (n = 78) |
---|---|---|
White/Caucasian | 33 (45.2%) | 40 (51.9%) |
Hispanic/Latino | 26 (35.6%) | 6 (7.8%) |
African American/Black | 3 (4.1%) | 14 (18.2%) |
Asian or Pacific Islander | 2 (2.6%) | 3 (3.9%) |
Multiple ethnicities or Other | 9 (12.3%) | 14 (18.2%) |
Not reported | 3 (4.1%) | 1 (1.2%) |
Attribute | ASD (n = 76) | TD (n = 78) |
---|---|---|
None | 38 (52.8%) | 58 (75.3%) |
ADHD | 2 (2.8%) | 1 (1.3%) |
Seasonal Allergies | 30 (41.7%) | 17 (22.4%) |
Asthma | 2 (2.8%) | 0 (0%) |
Celiac Disease | 1 (1.4%) | 0 (0%) |
GERD | 1 (1.4%) | 0 (0%) |
PTSD | 0 (0%) | 1 (1.3%) |
Sleep Apnea | 2 (2.8%) | 0 (0%) |
Not reported | 4 (5.6%) | 1 (1.3%) |
Attribute | ASD (n = 76) | TD (n = 78) |
---|---|---|
None | 69 (92%) | 75 (97.4%) |
Anti-depressant | 2 (2.7%) | 0 (0%) |
Anti-psychotic | 0 (0%) | 1 (1.3%) |
Sedative | 1 (1.3%) | 0 (0%) |
SSRI | 2 (2.27%) | 0 (0%) |
Stimulant | 1 (1.3%) | 1 (1.3%) |
Not reported | 1 (1.3%) | 1 (1.3%) |
# A tibble: 5 × 1,318
group CHIP CEBPB NSE PIAS4 `IL-10 Ra` STAT3 IRF1 `c-Jun` `Mcl-1`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ASD 0.335 0.520 -0.554 0.650 -0.358 0.305 -0.484 0.309 1.57
2 ASD -0.0715 1.01 3 1.28 -0.133 1.13 0.253 0.408 0.0643
3 ASD -0.406 -0.531 -0.0592 1.13 0.554 -0.334 0.287 -0.845 1.42
4 ASD -0.102 -0.251 1.47 0.0773 -0.705 0.893 2.61 -0.372 -0.467
5 ASD -0.395 -0.536 0.0410 -0.299 -0.830 0.899 1.01 -0.843 -1.15
# … with 1,308 more variables: OAS1 <dbl>, `c-Myc` <dbl>, SMAD3 <dbl>,
# SMAD2 <dbl>, `IL-23` <dbl>, PDGFRA <dbl>, `IL-12` <dbl>, STAT1 <dbl>,
# STAT6 <dbl>, LRRK2 <dbl>, Osteocalcin <dbl>, `IL-5` <dbl>, GPDA <dbl>,
# IgA <dbl>, LPPL <dbl>, HEMK2 <dbl>, PDXK <dbl>, TLR4 <dbl>, REG4 <dbl>,
# `HSP 27` <dbl>, `YKL-40` <dbl>, `Alpha enolase` <dbl>, `Apo L1` <dbl>,
# CD38 <dbl>, CD59 <dbl>, FABPL <dbl>, `GDF-11` <dbl>, BTC <dbl>,
# `HIF-1a` <dbl>, S100A6 <dbl>, SECTM1 <dbl>, RSPO3 <dbl>, PSP <dbl>, …
# A tibble: 2 × 2
group n
<chr> <int>
1 ASD 76
2 TD 78
Methodology
multiple testing
classification: logistic regression; random forests
variable selection: LASSO regularization
classification accuracy measures
Concepts
data partitioning for predictive modeling
model interpretability
high dimensional data \(n < p\)
Idea: test for a significant difference in serum levels between groups for a given protein, say protein \(i\).
Notation:
\(\mu^i_{ASD}\): mean serum level of protein \(i\) in the ASD group
\(\mu^i_{TD}\): mean serum level of protein \(i\) in the TD group
\(\delta_i\): difference in means \(\mu^i_{ASD} - \mu^i_{TD}\)
hats indicate sample estimates (e.g. \(\hat{\delta}_i\))
The \(t\)-test tests \(H_{0i}: \delta_i = 0\) against its negation \(\neg H_{0i}: \delta_i \neq 0\) using the rule
\[ \text{reject $H_{0i}$ if}\qquad \left|\frac{\hat{\delta}_i}{SE(\hat{\delta}_i)}\right| > t_\alpha \]
The \(p\)-value for a test is the probability of obtaining a sample at least as contrary to \(H_{0i}\) as the sample in hand, assuming \(H_{0i}\) is true.
By construction, \(p < \alpha\) just in case the test rejects with type I error controlled at \(\alpha\).
So a common heuristic is:
\[ \text{reject $H_{0i}$ if} \qquad p_i \leq \alpha \]
Here is R output for one test.
asd %>%
t_test(formula = CHIP ~ group,
order = c('ASD', 'TD'),
alternative = 'two-sided',
var.equal = F)
# A tibble: 1 × 7
statistic t_df p_value alternative estimate lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 0.927 75.7 0.357 two.sided 384. -441. 1210.
Questions:
A plausible approach for identifying a protein panel, then, is to select all those proteins for which the \(t\)-test indicates a significant difference.
1,317 tests
easy to compute
conceptually straightforward
How likely are mistakes?
Let \(H_i\) denote the \(i\)th null hypothesis and \(R_i\) denote the event that \(H_i\) is rejected.
\(H_i\) | \(\neg H_i\) | |
---|---|---|
\(R_i\) | \(V\) false rejections | \(S\) correct |
\(\neg R_i\) | \(T\) correct | \(W\) false non-rejections |
The multiple testing problem is that individual error rates compound over multiple tests.
Familywise error rate (FWER) is the probability of one or more type I errors: \(P(V \geq 1)\).
Suppose there are \(m\) true hypotheses \(\mathcal{H}: \{H_i: i \in C\}\).
If the tests are independent and exact then:
\[ \begin{aligned} P(V \geq 1) &= P\left[ \bigcup_{i \in C} R_i | \mathcal{H} \right] \\ &= 1 - \prod_{i \in C} \left( 1- P(R_i|H_i) \right) \\ &= 1 - (1 - \alpha)^m \end{aligned} \]
If individual tests are exactly controlled at \(\alpha = 0.05\) and independent, at least one error is nearly certain by 100 tests.
The simplest multiple testing correction is based on the Bonferroni inequality:
\[ P\left[ \bigcup_{i \in C} R_i | \mathcal{H} \right] \leq \sum_{i \in C} P(R_i|\mathcal{H}) \]
If the individual tests are controlled at level \(\alpha\), then \(FWER \leq m\alpha\).
So a simple solution is to test at level \(\alpha^* = \frac{\alpha}{m}\).
In other words, reject if \(p_i < \frac{\alpha}{m}\).
FWER control will limit false rejections, but at the cost of power; controlling the probability of one type I error is a conservative approach.
More common in modern applications are procedures to control false discovery rate: the expected proportion of rejections that are false.
\[ \text{FDR} = \mathbb{E}\left[\frac{\text{false rejections}}{\text{total rejections}}\right] \]
Conceptually, if say FDR is controlled at \(0.05\), then one would expect 5% of rejections to be false.
Benjamini and Hochberg (1995) conceived a procedure based on sorting \(p\)-values.
Supposing \(m\) independent tests are performed:
They proved that this controls FDR at \(\alpha\).
The Benjamini-Hochberg assumes tests are independent, which is obviously not true in most situations. (Why?)
Benjamini and Yekutieli (2001) modified the correction to hold without the independence assumption:
Above, \(H_m = \sum_{i = 1}^m \frac{1}{i}\) .
The easiest way to implement these corrections is to adjust the \(p\)-values with a multiplier:
trim_fn <- function(x){
x[x > 3] <- 3
x[x < -3] <- -3
return(x)
}
asd_clean <- asd %>%
select(-ados) %>%
# log transform
mutate(across(.cols = -group, log10)) %>%
# center and scale
mutate(across(.cols = -group, ~ scale(.x)[, 1])) %>%
# trim outliers (affects results??)
mutate(across(.cols = -group, trim_fn))
asd_nested <- asd_clean %>%
pivot_longer(-group,
names_to = 'protein',
values_to = 'level') %>%
nest(data = c(level, group))
asd_nested %>% head(4)
# A tibble: 4 × 2
protein data
<chr> <list>
1 CHIP <tibble [154 × 2]>
2 CEBPB <tibble [154 × 2]>
3 NSE <tibble [154 × 2]>
4 PIAS4 <tibble [154 × 2]>
# compute for several groups
test_fn <- function(.df){
t_test(.df,
formula = level ~ group,
order = c('ASD', 'TD'),
alternative = 'two-sided',
var.equal = F)
}
tt_out <- asd_nested %>%
mutate(ttest = map(data, test_fn)) %>%
unnest(ttest) %>%
arrange(p_value)
tt_out %>% head(5)
# A tibble: 5 × 9
protein data statistic t_df p_value alternative estimate lower_ci
<chr> <list> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 DERM <tibble> -6.10 151. 8.27e-9 two.sided -0.885 -1.17
2 RELT <tibble> -5.65 152. 7.82e-8 two.sided -0.775 -1.05
3 FSTL1 <tibble> -5.27 152. 4.66e-7 two.sided -0.783 -1.08
4 C1QR1 <tibble> -5.26 152. 4.79e-7 two.sided -0.782 -1.08
5 Calcineurin <tibble> -5.24 151. 5.37e-7 two.sided -0.734 -1.01
# … with 1 more variable: upper_ci <dbl>
# multiple testing corrections
m <- nrow(tt_out)
hm <- log(m) + 1/(2*m) - digamma(1)
tt_corrected <- tt_out %>%
select(data, protein, p_value) %>%
mutate(rank = row_number()) %>%
mutate(p_bh = p_value*m/rank,
p_by = p_value*m*hm/rank,
p_bonf = p_value*m)
tt_corrected %>% head(5)
# A tibble: 5 × 7
data protein p_value rank p_bh p_by p_bonf
<list> <chr> <dbl> <int> <dbl> <dbl> <dbl>
1 <tibble [154 × 2]> DERM 0.00000000827 1 0.0000109 0.0000845 1.09e-5
2 <tibble [154 × 2]> RELT 0.0000000782 2 0.0000515 0.000400 1.03e-4
3 <tibble [154 × 2]> FSTL1 0.000000466 3 0.000205 0.00159 6.14e-4
4 <tibble [154 × 2]> C1QR1 0.000000479 4 0.000158 0.00122 6.31e-4
5 <tibble [154 × 2]> Calcineurin 0.000000537 5 0.000141 0.00110 7.07e-4
# A tibble: 10 × 2
protein p_by
<chr> <dbl>
1 DERM 0.0000845
2 RELT 0.000400
3 Calcineurin 0.00110
4 C1QR1 0.00122
5 MRC2 0.00132
6 IgD 0.00136
7 CXCL16, soluble 0.00149
8 PTN 0.00154
9 FSTL1 0.00159
10 Cadherin-5 0.00179
Other approaches to the same problem:
correlation with ADOS (severity diagnostic score)
variable importance in random forest classifier