library(tidyverse)
# install.packages('infer') # execute once then comment out
# data location
<- 'https://raw.githubusercontent.com/pstat197/pstat197a/main/materials/labs/lab3-iteration/data/biomarker-clean.csv'
url
# function for outlier trimming
<- function(x){
trim_fn > 3] <- 3
x[x < -3] <- -3
x[x
return(x)
}
# read in and preprocess data
<- read_csv(url) %>%
asd select(-ados) %>%
# log transform
mutate(across(.cols = -group, log10)) %>%
# center and scale
mutate(across(.cols = -group, ~ scale(.x)[, 1])) %>%
# trim outliers
mutate(across(.cols = -group, trim_fn))
Iteration strategies
Background
In class we discussed multiple testing in the context of an application that involved performing 1,317 \(t\)-tests. Implementing these tests involves iteration: repeatedly performing the same computations.
Objective. Here we’ll look at a few strategies for iteration in R:
loops
functions in the
apply
familyfunctional programming using
tidyverse
We’ll illustrate these strategies using the biomarker data and reproduce some of the results shown in class.
Create a new script for lab 3 in your labs project/folder and copy-paste the code chunk below at the top of the script.
Loops
Simple examples
A loop is a set of instructions to be repeated a specified number of times while incrementing a flag or index value. For example:
for(i in 1:4){
print(2*i)
}
[1] 2
[1] 4
[1] 6
[1] 8
Here the instructions are:
- initialize index/flag
i
ati = 1
- execute code within the braces
{...}
- increment
i <- i + 1
- repeat steps 2-3 until
i = 5
We could make the loop a bit more verbose:
<- c(1, 2, 3, 4)
flag_vals for(i in flag_vals){
<- 2*i
out print(out)
}
[1] 2
[1] 4
[1] 6
[1] 8
Now to retain the results in memory, a storage data structure must be defined and the output of each iteration assigned to some element(s) of the storage object.
<- rep(NA, 4)
rslt for(i in 1:4){
<- 2*i
rslt[i]
} rslt
[1] 2 4 6 8
If we want to perform the same calculation for all values in a vector, we might do something like this:
<- rep(NA, 4)
rslt <- c(15, 27, 3, 12.6)
input_vals for(i in 1:4){
<- 2*input_vals[i]
rslt[i]
} rslt
[1] 30.0 54.0 6.0 25.2
Why does the following loop produce an NA
?
<- rep(NA, 4)
rslt <- rnorm(n = 3)
input_vals for(i in 1:4){
<- 2*input_vals[i]
rslt[i]
}
rslt
[1] -3.3664950 -0.5316422 1.2368258 NA
Loops are substantially similar in any programming language but usually not optimized for performance. Additionally, they are somewhat verbose and hard to read due to explicit use of indexing in the syntax.
Multiple testing with loops
In base R, the \(t\)-test is performed using t.test(...)
, which takes as arguments two vectors of observations (one for each group). For instance:
<- asd %>% filter(group == 'ASD') %>% pull(CHIP)
x <- asd %>% filter(group == 'TD') %>% pull(CHIP)
y t.test(x, y, var.equal = F)
Welch Two Sample t-test
data: x and y
t = -0.18812, df = 151.74, p-value = 0.851
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2588194 0.2138173
sample estimates:
mean of x mean of y
-0.04922954 -0.02672849
The output is a list:
t.test(x, y) %>% str()
List of 10
$ statistic : Named num -0.188
..- attr(*, "names")= chr "t"
$ parameter : Named num 152
..- attr(*, "names")= chr "df"
$ p.value : num 0.851
$ conf.int : num [1:2] -0.259 0.214
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num [1:2] -0.0492 -0.0267
..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
$ null.value : Named num 0
..- attr(*, "names")= chr "difference in means"
$ stderr : num 0.12
$ alternative: chr "two.sided"
$ method : chr "Welch Two Sample t-test"
$ data.name : chr "x and y"
- attr(*, "class")= chr "htest"
So if we want the p-value:
t.test(x, y, var.equal = F)$p.value
[1] 0.8510352
To calculate \(p\)-values for all tests using a loop, we wrap the code we used to perform one \(t\)-test in a for
loop and add appropriate indexing. For speed, we’ll just compute the first 100 tests:
<- 100
n_tests <- rep(NA, n_tests)
p_vals for(i in 1:n_tests){
<- asd %>% filter(group == 'ASD') %>% pull(i + 1)
x <- asd %>% filter(group == 'TD') %>% pull(i + 1)
y <- t.test(x, y, var.equal = F)$p.value
p_vals[i] }
To line these up with the proteins they correspond to, it’s necessary to keep track of the indexing carefully. In this case, the indexing corresponds to the order of columns. So we could create a data frame like so:
tibble(protein = colnames(asd)[2:(n_tests + 1)],
p = p_vals)
# A tibble: 100 × 2
protein p
<chr> <dbl>
1 CHIP 0.851
2 CEBPB 0.0322
3 NSE 0.350
4 PIAS4 0.104
5 IL-10 Ra 0.0232
6 STAT3 0.00183
7 IRF1 0.592
8 c-Jun 0.0351
9 Mcl-1 0.999
10 OAS1 0.942
# … with 90 more rows
Alternatively, we could have set up the loop to output this result:
<- 100
n_tests <- tibble(protein = colnames(asd)[2:(n_tests + 1)],
rslt p = NA)
for(i in 1:n_tests){
<- asd %>% filter(group == 'ASD') %>% pull(i + 1)
x <- asd %>% filter(group == 'TD') %>% pull(i + 1)
y $p[i] <- t.test(x, y, var.equal = F)$p.value
rslt }
Follow the example above to write a loop that stores both the \(p\)-values and the estimated differences for the first 50 proteins.
Apply family
Simple examples
In R, the apply family of functions allows one to efficiently iterate a function over an index set. So, to execute our simple for loop using apply
, we could do something like this:
<- rnorm(n = 4)
vals <- function(x){2*x}
simple_fn lapply(vals, simple_fn)
[[1]]
[1] 1.062206
[[2]]
[1] 1.619845
[[3]]
[1] 5.018638
[[4]]
[1] 0.800267
This applies simple_fn
to each element of vals
, and returns the result as a list. If we want a neater output format, we could use sapply
, which is short for sort-apply:
sapply(vals, simple_fn)
[1] 1.062206 1.619845 5.018638 0.800267
In more complex settings it often makes sense to apply a function across an index set. This is very similar conceptually to a for loop, but faster and easier to read.
# apply a function to an index set
<- function(i){2*vals[i]}
simple_fn_ix <- sapply(1:length(vals), simple_fn_ix)
rslt_apply
# equivalent for loop
<- rep(NA, length(vals))
rslt_loop for(i in 1:length(vals)){
<- 2*vals[i]
rslt_loop[i]
}
# compare
rbind(rslt_loop, rslt_apply)
[,1] [,2] [,3] [,4]
rslt_loop 1.062206 1.619845 5.018638 0.800267
rslt_apply 1.062206 1.619845 5.018638 0.800267
\(t\)-tests using apply
We can use apply
functions to compute \(t\)-tests for the proteins in the ASD data by coercing the data to a list of data frames that contain the grouping and level for each protein.
# number of tests to perform
<- 100
n_tests
# convert to a list
<- asd %>%
asd_list select(1:(n_tests + 1)) %>%
pivot_longer(cols = -group,
names_to = 'protein',
values_to = 'level') %>%
group_by(protein) %>%
group_split()
# first entry in list
1]] asd_list[[
# A tibble: 154 × 3
group protein level
<chr> <chr> <dbl>
1 ASD 14-3-3 protein beta/alpha -0.124
2 ASD 14-3-3 protein beta/alpha 0.487
3 ASD 14-3-3 protein beta/alpha -0.801
4 ASD 14-3-3 protein beta/alpha 2.73
5 ASD 14-3-3 protein beta/alpha 1.24
6 ASD 14-3-3 protein beta/alpha 0.250
7 ASD 14-3-3 protein beta/alpha 0.932
8 ASD 14-3-3 protein beta/alpha 0.0873
9 ASD 14-3-3 protein beta/alpha 0.213
10 ASD 14-3-3 protein beta/alpha 0.157
# … with 144 more rows
The function t.test(...)
can also perform the test using a formula of the form y ~ x
and a data frame containing x
and y
, as below.
t.test(level ~ group, data = asd_list[[1]])
Welch Two Sample t-test
data: level by group
t = -1.5671, df = 150.2, p-value = 0.1192
alternative hypothesis: true difference in means between group ASD and group TD is not equal to 0
95 percent confidence interval:
-0.54341111 0.06269287
sample estimates:
mean in group ASD mean in group TD
-0.1341683 0.1061909
If we just want the \(p\)-value again, we can wrap this code in a function whose argument is the index \(i\). This function will return the \(p\)-value for the \(i\)th protein.
# p value for ith protein
<- function(i){
tt_fn t.test(level ~ group, data = asd_list[[i]])$p.value
}
# check
tt_fn(1)
[1] 0.1191888
Now to perform many tests, we can simply iterate this function over consecutive index values 1:n_tests
:
sapply(1:n_tests, tt_fn)
[1] 1.191888e-01 2.972829e-01 8.297144e-01 3.034583e-02 8.136635e-01
[6] 9.517828e-01 3.553359e-01 2.671529e-03 6.458878e-01 3.915314e-04
[11] 1.759142e-01 3.505666e-02 5.095419e-01 4.788545e-07 2.862286e-01
[16] 5.714296e-01 7.052780e-03 3.220583e-02 8.510352e-01 1.267133e-03
[21] 9.482110e-03 1.293157e-04 7.804081e-03 2.208460e-04 1.407044e-01
[26] 1.023033e-01 8.995855e-02 1.578665e-02 3.113212e-04 2.920587e-02
[31] 4.663516e-07 2.395764e-01 5.709433e-03 8.962287e-02 2.700053e-02
[36] 5.357313e-01 7.392658e-01 8.665332e-01 3.538260e-02 1.956257e-04
[41] 8.658766e-01 2.111378e-04 3.286108e-01 5.374305e-01 1.108687e-01
[46] 1.711403e-01 4.808293e-01 6.775472e-01 3.556564e-03 2.322968e-02
[51] 3.746226e-01 7.804488e-01 3.175372e-01 4.085249e-01 4.746117e-03
[56] 5.917788e-01 3.748021e-02 6.433125e-01 1.121721e-03 3.234610e-03
[61] 4.154758e-03 1.669733e-03 1.726578e-03 9.990017e-01 7.100178e-04
[66] 4.811425e-01 8.978465e-01 3.503310e-01 9.423978e-01 3.925728e-01
[71] 3.025965e-01 4.511875e-02 4.219360e-01 2.117196e-01 1.036412e-01
[76] 5.590746e-01 8.148983e-01 1.399029e-02 5.096269e-04 9.145121e-02
[81] 3.331394e-02 4.350959e-01 8.721647e-01 3.266951e-03 2.704495e-01
[86] 4.929196e-01 1.010954e-03 3.144033e-01 7.933758e-04 1.929813e-03
[91] 6.104791e-02 1.832399e-03 1.333513e-02 6.412884e-01 2.605232e-02
[96] 4.130732e-01 2.579393e-01 4.096623e-01 2.925137e-01 5.866341e-01
You might have noticed this was much faster than the loop. We can time it:
<- Sys.time()
start <- sapply(1:n_tests, tt_fn)
rslt <- Sys.time()
end
- start end
Time difference of 0.209661 secs
And compare with the for loop:
<- Sys.time()
start <- 100
n_tests <- tibble(protein = colnames(asd)[2:(n_tests + 1)],
rslt p = NA)
for(i in 1:n_tests){
<- asd %>% filter(group == 'ASD') %>% pull(i + 1)
x <- asd %>% filter(group == 'TD') %>% pull(i + 1)
y $p[i] <- t.test(x, y, var.equal = F)$p.value
rslt
}<- Sys.time()
end
- start end
Time difference of 11.63856 secs
Another nice feature of sapply
is its ability to sort and arrange multiple outputs. For example, if the function is adjusted to return both the \(p\)-value and the test statistic:
<- function(i){
tt_fn <- t.test(level ~ group, data = asd_list[[i]])
test_rslt <- c(pval = test_rslt$p.value,
out tstat = test_rslt$statistic)
out
}
tt_fn(1)
pval tstat.t
0.1191888 -1.5671297
Then sapply
will return a matrix:
sapply(1:5, tt_fn) %>% t() %>% as_tibble()
# A tibble: 5 × 2
pval tstat.t
<dbl> <dbl>
1 0.119 -1.57
2 0.297 -1.05
3 0.830 -0.215
4 0.0303 -2.19
5 0.814 0.236
- Use
sapply
to obtain the estimated differences and standard errors for the groupwise comparisons for the first 50 proteins. - Arrange the result in a data frame with a column indicating the protein, a column indicating the estimated group difference, and a column indicating the standard error.
Tidyverse
A final strategy for iteration comes from functional programming tools in tidyverse
. The basic idea is:
define a grouping structure using relevant variables (in this case, proteins)
collapse the data into separate data frames by group
apply a function to each data frame that produces test output given input data
Nesting
One thing that tibbles can do that data frames cannot is store list-columns: columns that are lists of arbitrary objects. This allows for the arrangement of a much more general collection of objects in tabular form.
An intuitive example is nested data: a list-column of data frames having the same columns. If we nest the ASD data by protein, we obtain a data frame that looks like this:
<- asd %>%
asd_nested pivot_longer(-group,
names_to = 'protein',
values_to = 'level') %>%
nest(data = c(level, group))
%>% head(5) asd_nested
# A tibble: 5 × 2
protein data
<chr> <list>
1 CHIP <tibble [154 × 2]>
2 CEBPB <tibble [154 × 2]>
3 NSE <tibble [154 × 2]>
4 PIAS4 <tibble [154 × 2]>
5 IL-10 Ra <tibble [154 × 2]>
The data
column consists of data frames, one per protein, containing the variables group
and level
:
%>%
asd_nested slice(1L) %>%
pull(data)
[[1]]
# A tibble: 154 × 2
level group
<dbl> <chr>
1 0.335 ASD
2 -0.0715 ASD
3 -0.406 ASD
4 -0.102 ASD
5 -0.395 ASD
6 -0.126 ASD
7 0.486 ASD
8 -0.990 ASD
9 -0.108 ASD
10 0.485 ASD
# … with 144 more rows
The map()
function
In an ordinary data frame one can define a new variable as a function of other variables. The same can be done with list-columns in a tibble: one can define a new variable as a function of the elements of a list stored in another column. To do this, one uses the map()
function, which is essentially the tidyverse version of lapply()
:
map(.x, .fn)
means roughly “apply the function.fn
to each element in.x
”
Here we can write a function that takes a data frame with protein level and group as input, and returns a t test as output; then computing each test is as simple as calling mutate
:
<- function(.df){
tt_fn t.test(level ~ group, data = .df)
}
<- asd_nested %>%
rslt slice(1:10) %>%
mutate(ttest.out = map(data, tt_fn))
rslt
# A tibble: 10 × 3
protein data ttest.out
<chr> <list> <list>
1 CHIP <tibble [154 × 2]> <htest>
2 CEBPB <tibble [154 × 2]> <htest>
3 NSE <tibble [154 × 2]> <htest>
4 PIAS4 <tibble [154 × 2]> <htest>
5 IL-10 Ra <tibble [154 × 2]> <htest>
6 STAT3 <tibble [154 × 2]> <htest>
7 IRF1 <tibble [154 × 2]> <htest>
8 c-Jun <tibble [154 × 2]> <htest>
9 Mcl-1 <tibble [154 × 2]> <htest>
10 OAS1 <tibble [154 × 2]> <htest>
%>% slice(1L) %>% pull(ttest.out) rslt
[[1]]
Welch Two Sample t-test
data: level by group
t = -0.18812, df = 151.74, p-value = 0.851
alternative hypothesis: true difference in means between group ASD and group TD is not equal to 0
95 percent confidence interval:
-0.2588194 0.2138173
sample estimates:
mean in group ASD mean in group TD
-0.04922954 -0.02672849
While all the data we might want are there, the output is a little unwieldy. Luckily, the infer
package contains a pipe-operator-friendly function infer::t_test
that returns results in a tidy fashion.
%>%
asd_nested slice(1L) %>%
unnest(cols = data) %>%
::t_test(formula = level ~ group,
inferorder = c('ASD', 'TD'),
alternative = 'two-sided',
var.equal = F)
# A tibble: 1 × 7
statistic t_df p_value alternative estimate lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 -0.188 152. 0.851 two.sided -0.0225 -0.259 0.214
We can create a wrapper around this function suitable for our purposes and then apply it to the list-column in asd_nested
:
# wrapper around infer::t_test
<- function(.df){
tt_fn ::t_test(.df,
inferformula = level ~ group,
order = c('ASD', 'TD'),
alternative = 'two-sided',
var.equal = F)
}
# compute test results
<- asd_nested %>%
tt_out slice(1:n_tests) %>%
mutate(ttest = map(data, tt_fn))
# preview
%>% head(4) tt_out
# A tibble: 4 × 3
protein data ttest
<chr> <list> <list>
1 CHIP <tibble [154 × 2]> <tibble [1 × 7]>
2 CEBPB <tibble [154 × 2]> <tibble [1 × 7]>
3 NSE <tibble [154 × 2]> <tibble [1 × 7]>
4 PIAS4 <tibble [154 × 2]> <tibble [1 × 7]>
Notice that ttest
is also a list-column comprised of separate data frames. This column can be un-nested to show the output of infer::t_test
explicitly:
%>%
tt_out unnest(ttest) %>%
head(4)
# A tibble: 4 × 9
protein data statistic t_df p_value alternative estimate lower_ci
<chr> <list> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 CHIP <tibble> -0.188 152. 0.851 two.sided -0.0225 -0.259
2 CEBPB <tibble> 2.16 150. 0.0322 two.sided 0.317 0.0273
3 NSE <tibble> 0.937 151. 0.350 two.sided 0.148 -0.164
4 PIAS4 <tibble> 1.64 152. 0.104 two.sided 0.222 -0.0459
# … with 1 more variable: upper_ci <dbl>
This approach has a few advantages, namely, it is syntactically more readable than either of the other approaches and it works with the pipe operator, so could in theory be incorporated into a chain that performs additional calculations on, say, the results of the \(t\)-test. However, the drawback is that it is slow:
# time it
<- Sys.time()
start <- asd_nested %>%
tt_out slice(1:n_tests) %>%
mutate(ttest = map(data, tt_fn))
<- Sys.time()
end
- start end
Time difference of 4.896297 secs
It’s not as slow as a for loop, but it’s much slower than apply
functions. If speed is a concern or the number of iterations is especially large, apply
would be a better choice.
Adjusting p-values
To adjust the \(p\)-values, we simply manipulate the p_value
column:
# bonferroni correction
%>%
tt_out unnest(ttest) %>%
mutate(p_adj = p_value*n_tests) %>%
select(protein, p_value, p_adj) %>%
arrange(p_adj) %>%
head(4)
# A tibble: 4 × 3
protein p_value p_adj
<chr> <dbl> <dbl>
1 FSTL1 0.000000466 0.0000466
2 C1QR1 0.000000479 0.0000479
3 DSC2 0.000129 0.0129
4 HIF-1a 0.000196 0.0196
Implement the Benjamini-Hochberg correction
- Sort the raw \(p\)-values using
arrange()
- Add a
rank
column of consecutive integers (neat trick: try usingrow_number()
) - Add a column
p_adj
containing \(\frac{m}{i} p_{(i)}\) where \(m\) is the number of tests, \(i\) is the rank of the \(p\)-value, and \(p_{(i)}\) is the \(i\)th smallest \(p\)-value - Find the collection of proteins with significantly different serum levels between the ASD and TD groups while controlling the false discovery rate at 1%.
Develop working code to execute 50 tests. Then use it to compute all 1317 tests.
Checklist
You’ve computed \(p\)-values iteratively using a loop,
sapply
, andnest
+map
.You have commented codes in your script for each action item.
You’ve obtained a list of the significant proteins with 1% FDR.
You’ve saved your work somewhere where you can access it later.