Iteration strategies

Background

In class we discussed multiple testing in the context of an application that involved performing 1,317 \(t\)-tests. Implementing these tests involves iteration: repeatedly performing the same computations.

Objective. Here we’ll look at a few strategies for iteration in R:

  • loops

  • functions in the apply family

  • functional programming using tidyverse

We’ll illustrate these strategies using the biomarker data and reproduce some of the results shown in class.

Action

Create a new script for lab 3 in your labs project/folder and copy-paste the code chunk below at the top of the script.

library(tidyverse)
# install.packages('infer') # execute once then comment out

# data location
url <- 'https://raw.githubusercontent.com/pstat197/pstat197a/main/materials/labs/lab3-iteration/data/biomarker-clean.csv'

# function for outlier trimming
trim_fn <- function(x){
  x[x > 3] <- 3
  x[x < -3] <- -3
  
  return(x)
}

# read in and preprocess data
asd <- read_csv(url) %>%
  select(-ados) %>%
  # log transform
  mutate(across(.cols = -group, log10)) %>%
  # center and scale
  mutate(across(.cols = -group, ~ scale(.x)[, 1])) %>%
  # trim outliers
  mutate(across(.cols = -group, trim_fn))

Loops

Simple examples

A loop is a set of instructions to be repeated a specified number of times while incrementing a flag or index value. For example:

for(i in 1:4){
  print(2*i)
}
[1] 2
[1] 4
[1] 6
[1] 8

Here the instructions are:

  1. initialize index/flag i at i = 1
  2. execute code within the braces {...}
  3. increment i <- i + 1
  4. repeat steps 2-3 until i = 5

We could make the loop a bit more verbose:

flag_vals <- c(1, 2, 3, 4)
for(i in flag_vals){
  out <- 2*i
  print(out)
}
[1] 2
[1] 4
[1] 6
[1] 8

Now to retain the results in memory, a storage data structure must be defined and the output of each iteration assigned to some element(s) of the storage object.

rslt <- rep(NA, 4)
for(i in 1:4){
  rslt[i] <- 2*i
}
rslt
[1] 2 4 6 8

If we want to perform the same calculation for all values in a vector, we might do something like this:

rslt <- rep(NA, 4)
input_vals <- c(15, 27, 3, 12.6)
for(i in 1:4){
  rslt[i] <- 2*input_vals[i]
}
rslt
[1] 30.0 54.0  6.0 25.2
Check your understanding

Why does the following loop produce an NA ?

rslt <- rep(NA, 4)
input_vals <- rnorm(n = 3)
for(i in 1:4){
  rslt[i] <- 2*input_vals[i]
}

rslt
[1] -3.3664950 -0.5316422  1.2368258         NA

Loops are substantially similar in any programming language but usually not optimized for performance. Additionally, they are somewhat verbose and hard to read due to explicit use of indexing in the syntax.

Multiple testing with loops

In base R, the \(t\)-test is performed using t.test(...) , which takes as arguments two vectors of observations (one for each group). For instance:

x <- asd %>% filter(group == 'ASD') %>% pull(CHIP)
y <- asd %>% filter(group == 'TD') %>% pull(CHIP)
t.test(x, y, var.equal = F)

    Welch Two Sample t-test

data:  x and y
t = -0.18812, df = 151.74, p-value = 0.851
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2588194  0.2138173
sample estimates:
  mean of x   mean of y 
-0.04922954 -0.02672849 

The output is a list:

t.test(x, y) %>% str()
List of 10
 $ statistic  : Named num -0.188
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 152
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.851
 $ conf.int   : num [1:2] -0.259 0.214
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num [1:2] -0.0492 -0.0267
  ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "difference in means"
 $ stderr     : num 0.12
 $ alternative: chr "two.sided"
 $ method     : chr "Welch Two Sample t-test"
 $ data.name  : chr "x and y"
 - attr(*, "class")= chr "htest"

So if we want the p-value:

t.test(x, y, var.equal = F)$p.value
[1] 0.8510352

To calculate \(p\)-values for all tests using a loop, we wrap the code we used to perform one \(t\)-test in a for loop and add appropriate indexing. For speed, we’ll just compute the first 100 tests:

n_tests <- 100
p_vals <- rep(NA, n_tests)
for(i in 1:n_tests){
  x <- asd %>% filter(group == 'ASD') %>% pull(i + 1)
  y <- asd %>% filter(group == 'TD') %>% pull(i + 1)
  p_vals[i] <- t.test(x, y, var.equal = F)$p.value
}

To line these up with the proteins they correspond to, it’s necessary to keep track of the indexing carefully. In this case, the indexing corresponds to the order of columns. So we could create a data frame like so:

tibble(protein = colnames(asd)[2:(n_tests + 1)],
       p = p_vals)
# A tibble: 100 × 2
   protein        p
   <chr>      <dbl>
 1 CHIP     0.851  
 2 CEBPB    0.0322 
 3 NSE      0.350  
 4 PIAS4    0.104  
 5 IL-10 Ra 0.0232 
 6 STAT3    0.00183
 7 IRF1     0.592  
 8 c-Jun    0.0351 
 9 Mcl-1    0.999  
10 OAS1     0.942  
# … with 90 more rows

Alternatively, we could have set up the loop to output this result:

n_tests <- 100
rslt <- tibble(protein = colnames(asd)[2:(n_tests + 1)],
               p = NA)
for(i in 1:n_tests){
  x <- asd %>% filter(group == 'ASD') %>% pull(i + 1)
  y <- asd %>% filter(group == 'TD') %>% pull(i + 1)
  rslt$p[i] <- t.test(x, y, var.equal = F)$p.value
}
Action

Follow the example above to write a loop that stores both the \(p\)-values and the estimated differences for the first 50 proteins.

Apply family

Simple examples

In R, the apply family of functions allows one to efficiently iterate a function over an index set. So, to execute our simple for loop using apply , we could do something like this:

vals <- rnorm(n = 4)
simple_fn <- function(x){2*x}
lapply(vals, simple_fn)
[[1]]
[1] 1.062206

[[2]]
[1] 1.619845

[[3]]
[1] 5.018638

[[4]]
[1] 0.800267

This applies simple_fn to each element of vals , and returns the result as a list. If we want a neater output format, we could use sapply , which is short for sort-apply:

sapply(vals, simple_fn)
[1] 1.062206 1.619845 5.018638 0.800267

In more complex settings it often makes sense to apply a function across an index set. This is very similar conceptually to a for loop, but faster and easier to read.

# apply a function to an index set
simple_fn_ix <- function(i){2*vals[i]}
rslt_apply <- sapply(1:length(vals), simple_fn_ix)

# equivalent for loop
rslt_loop <- rep(NA, length(vals))
for(i in 1:length(vals)){
  rslt_loop[i] <- 2*vals[i]
}

# compare
rbind(rslt_loop, rslt_apply)
               [,1]     [,2]     [,3]     [,4]
rslt_loop  1.062206 1.619845 5.018638 0.800267
rslt_apply 1.062206 1.619845 5.018638 0.800267

\(t\)-tests using apply

We can use apply functions to compute \(t\)-tests for the proteins in the ASD data by coercing the data to a list of data frames that contain the grouping and level for each protein.

# number of tests to perform
n_tests <- 100

# convert to a list
asd_list <- asd %>% 
  select(1:(n_tests + 1)) %>%
  pivot_longer(cols = -group,
               names_to = 'protein',
               values_to = 'level') %>%
  group_by(protein) %>%
  group_split()

# first entry in list
asd_list[[1]]
# A tibble: 154 × 3
   group protein                     level
   <chr> <chr>                       <dbl>
 1 ASD   14-3-3 protein beta/alpha -0.124 
 2 ASD   14-3-3 protein beta/alpha  0.487 
 3 ASD   14-3-3 protein beta/alpha -0.801 
 4 ASD   14-3-3 protein beta/alpha  2.73  
 5 ASD   14-3-3 protein beta/alpha  1.24  
 6 ASD   14-3-3 protein beta/alpha  0.250 
 7 ASD   14-3-3 protein beta/alpha  0.932 
 8 ASD   14-3-3 protein beta/alpha  0.0873
 9 ASD   14-3-3 protein beta/alpha  0.213 
10 ASD   14-3-3 protein beta/alpha  0.157 
# … with 144 more rows

The function t.test(...) can also perform the test using a formula of the form y ~ x and a data frame containing x and y, as below.

t.test(level ~ group, data = asd_list[[1]])

    Welch Two Sample t-test

data:  level by group
t = -1.5671, df = 150.2, p-value = 0.1192
alternative hypothesis: true difference in means between group ASD and group TD is not equal to 0
95 percent confidence interval:
 -0.54341111  0.06269287
sample estimates:
mean in group ASD  mean in group TD 
       -0.1341683         0.1061909 

If we just want the \(p\)-value again, we can wrap this code in a function whose argument is the index \(i\). This function will return the \(p\)-value for the \(i\)th protein.

# p value for ith protein
tt_fn <- function(i){
  t.test(level ~ group, data = asd_list[[i]])$p.value
}

# check
tt_fn(1)
[1] 0.1191888

Now to perform many tests, we can simply iterate this function over consecutive index values 1:n_tests:

sapply(1:n_tests, tt_fn)
  [1] 1.191888e-01 2.972829e-01 8.297144e-01 3.034583e-02 8.136635e-01
  [6] 9.517828e-01 3.553359e-01 2.671529e-03 6.458878e-01 3.915314e-04
 [11] 1.759142e-01 3.505666e-02 5.095419e-01 4.788545e-07 2.862286e-01
 [16] 5.714296e-01 7.052780e-03 3.220583e-02 8.510352e-01 1.267133e-03
 [21] 9.482110e-03 1.293157e-04 7.804081e-03 2.208460e-04 1.407044e-01
 [26] 1.023033e-01 8.995855e-02 1.578665e-02 3.113212e-04 2.920587e-02
 [31] 4.663516e-07 2.395764e-01 5.709433e-03 8.962287e-02 2.700053e-02
 [36] 5.357313e-01 7.392658e-01 8.665332e-01 3.538260e-02 1.956257e-04
 [41] 8.658766e-01 2.111378e-04 3.286108e-01 5.374305e-01 1.108687e-01
 [46] 1.711403e-01 4.808293e-01 6.775472e-01 3.556564e-03 2.322968e-02
 [51] 3.746226e-01 7.804488e-01 3.175372e-01 4.085249e-01 4.746117e-03
 [56] 5.917788e-01 3.748021e-02 6.433125e-01 1.121721e-03 3.234610e-03
 [61] 4.154758e-03 1.669733e-03 1.726578e-03 9.990017e-01 7.100178e-04
 [66] 4.811425e-01 8.978465e-01 3.503310e-01 9.423978e-01 3.925728e-01
 [71] 3.025965e-01 4.511875e-02 4.219360e-01 2.117196e-01 1.036412e-01
 [76] 5.590746e-01 8.148983e-01 1.399029e-02 5.096269e-04 9.145121e-02
 [81] 3.331394e-02 4.350959e-01 8.721647e-01 3.266951e-03 2.704495e-01
 [86] 4.929196e-01 1.010954e-03 3.144033e-01 7.933758e-04 1.929813e-03
 [91] 6.104791e-02 1.832399e-03 1.333513e-02 6.412884e-01 2.605232e-02
 [96] 4.130732e-01 2.579393e-01 4.096623e-01 2.925137e-01 5.866341e-01

You might have noticed this was much faster than the loop. We can time it:

start <- Sys.time()
rslt <- sapply(1:n_tests, tt_fn)
end <- Sys.time()

end - start
Time difference of 0.209661 secs

And compare with the for loop:

start <- Sys.time()
n_tests <- 100
rslt <- tibble(protein = colnames(asd)[2:(n_tests + 1)],
               p = NA)
for(i in 1:n_tests){
  x <- asd %>% filter(group == 'ASD') %>% pull(i + 1)
  y <- asd %>% filter(group == 'TD') %>% pull(i + 1)
  rslt$p[i] <- t.test(x, y, var.equal = F)$p.value
}
end <- Sys.time()

end - start
Time difference of 11.63856 secs

Another nice feature of sapply is its ability to sort and arrange multiple outputs. For example, if the function is adjusted to return both the \(p\)-value and the test statistic:

tt_fn <- function(i){
  test_rslt <- t.test(level ~ group, data = asd_list[[i]])
  out <- c(pval = test_rslt$p.value, 
           tstat = test_rslt$statistic)
  out
}

tt_fn(1)
      pval    tstat.t 
 0.1191888 -1.5671297 

Then sapply will return a matrix:

sapply(1:5, tt_fn) %>% t() %>% as_tibble()
# A tibble: 5 × 2
    pval tstat.t
   <dbl>   <dbl>
1 0.119   -1.57 
2 0.297   -1.05 
3 0.830   -0.215
4 0.0303  -2.19 
5 0.814    0.236
Action
  1. Use sapply to obtain the estimated differences and standard errors for the groupwise comparisons for the first 50 proteins.
  2. Arrange the result in a data frame with a column indicating the protein, a column indicating the estimated group difference, and a column indicating the standard error.

Tidyverse

A final strategy for iteration comes from functional programming tools in tidyverse . The basic idea is:

  • define a grouping structure using relevant variables (in this case, proteins)

  • collapse the data into separate data frames by group

  • apply a function to each data frame that produces test output given input data

Nesting

One thing that tibbles can do that data frames cannot is store list-columns: columns that are lists of arbitrary objects. This allows for the arrangement of a much more general collection of objects in tabular form.

An intuitive example is nested data: a list-column of data frames having the same columns. If we nest the ASD data by protein, we obtain a data frame that looks like this:

asd_nested <- asd %>%
  pivot_longer(-group, 
               names_to = 'protein', 
               values_to = 'level') %>%
  nest(data = c(level, group))

asd_nested %>% head(5)
# A tibble: 5 × 2
  protein  data              
  <chr>    <list>            
1 CHIP     <tibble [154 × 2]>
2 CEBPB    <tibble [154 × 2]>
3 NSE      <tibble [154 × 2]>
4 PIAS4    <tibble [154 × 2]>
5 IL-10 Ra <tibble [154 × 2]>

The data column consists of data frames, one per protein, containing the variables group and level :

asd_nested %>%
  slice(1L) %>%
  pull(data)
[[1]]
# A tibble: 154 × 2
     level group
     <dbl> <chr>
 1  0.335  ASD  
 2 -0.0715 ASD  
 3 -0.406  ASD  
 4 -0.102  ASD  
 5 -0.395  ASD  
 6 -0.126  ASD  
 7  0.486  ASD  
 8 -0.990  ASD  
 9 -0.108  ASD  
10  0.485  ASD  
# … with 144 more rows

The map() function

In an ordinary data frame one can define a new variable as a function of other variables. The same can be done with list-columns in a tibble: one can define a new variable as a function of the elements of a list stored in another column. To do this, one uses the map() function, which is essentially the tidyverse version of lapply() :

  • map(.x, .fn) means roughly “apply the function .fn to each element in .x

Here we can write a function that takes a data frame with protein level and group as input, and returns a t test as output; then computing each test is as simple as calling mutate :

tt_fn <- function(.df){
  t.test(level ~ group, data = .df)
}

rslt <- asd_nested %>%
  slice(1:10) %>%
  mutate(ttest.out = map(data, tt_fn))

rslt
# A tibble: 10 × 3
   protein  data               ttest.out
   <chr>    <list>             <list>   
 1 CHIP     <tibble [154 × 2]> <htest>  
 2 CEBPB    <tibble [154 × 2]> <htest>  
 3 NSE      <tibble [154 × 2]> <htest>  
 4 PIAS4    <tibble [154 × 2]> <htest>  
 5 IL-10 Ra <tibble [154 × 2]> <htest>  
 6 STAT3    <tibble [154 × 2]> <htest>  
 7 IRF1     <tibble [154 × 2]> <htest>  
 8 c-Jun    <tibble [154 × 2]> <htest>  
 9 Mcl-1    <tibble [154 × 2]> <htest>  
10 OAS1     <tibble [154 × 2]> <htest>  
rslt %>% slice(1L) %>% pull(ttest.out)
[[1]]

    Welch Two Sample t-test

data:  level by group
t = -0.18812, df = 151.74, p-value = 0.851
alternative hypothesis: true difference in means between group ASD and group TD is not equal to 0
95 percent confidence interval:
 -0.2588194  0.2138173
sample estimates:
mean in group ASD  mean in group TD 
      -0.04922954       -0.02672849 

While all the data we might want are there, the output is a little unwieldy. Luckily, the infer package contains a pipe-operator-friendly function infer::t_test that returns results in a tidy fashion.

asd_nested %>% 
  slice(1L) %>% 
  unnest(cols = data) %>% 
  infer::t_test(formula = level ~ group,
         order = c('ASD', 'TD'),
         alternative = 'two-sided',
         var.equal = F)
# A tibble: 1 × 7
  statistic  t_df p_value alternative estimate lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>    <dbl>
1    -0.188  152.   0.851 two.sided    -0.0225   -0.259    0.214

We can create a wrapper around this function suitable for our purposes and then apply it to the list-column in asd_nested :

# wrapper around infer::t_test
tt_fn <- function(.df){
  infer::t_test(.df, 
         formula = level ~ group,
         order = c('ASD', 'TD'),
         alternative = 'two-sided',
         var.equal = F)
}

# compute test results
tt_out <- asd_nested %>%
  slice(1:n_tests) %>%
  mutate(ttest = map(data, tt_fn))

# preview
tt_out %>% head(4)
# A tibble: 4 × 3
  protein data               ttest           
  <chr>   <list>             <list>          
1 CHIP    <tibble [154 × 2]> <tibble [1 × 7]>
2 CEBPB   <tibble [154 × 2]> <tibble [1 × 7]>
3 NSE     <tibble [154 × 2]> <tibble [1 × 7]>
4 PIAS4   <tibble [154 × 2]> <tibble [1 × 7]>

Notice that ttest is also a list-column comprised of separate data frames. This column can be un-nested to show the output of infer::t_test explicitly:

tt_out %>% 
  unnest(ttest) %>%
  head(4)
# A tibble: 4 × 9
  protein data     statistic  t_df p_value alternative estimate lower_ci
  <chr>   <list>       <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>
1 CHIP    <tibble>    -0.188  152.  0.851  two.sided    -0.0225  -0.259 
2 CEBPB   <tibble>     2.16   150.  0.0322 two.sided     0.317    0.0273
3 NSE     <tibble>     0.937  151.  0.350  two.sided     0.148   -0.164 
4 PIAS4   <tibble>     1.64   152.  0.104  two.sided     0.222   -0.0459
# … with 1 more variable: upper_ci <dbl>

This approach has a few advantages, namely, it is syntactically more readable than either of the other approaches and it works with the pipe operator, so could in theory be incorporated into a chain that performs additional calculations on, say, the results of the \(t\)-test. However, the drawback is that it is slow:

# time it
start <- Sys.time()
tt_out <- asd_nested %>%
  slice(1:n_tests) %>%
  mutate(ttest = map(data, tt_fn))
end <- Sys.time()

end - start
Time difference of 4.896297 secs

It’s not as slow as a for loop, but it’s much slower than apply functions. If speed is a concern or the number of iterations is especially large, apply would be a better choice.

Adjusting p-values

To adjust the \(p\)-values, we simply manipulate the p_value column:

# bonferroni correction
tt_out %>% 
  unnest(ttest) %>%
  mutate(p_adj = p_value*n_tests) %>%
  select(protein, p_value, p_adj) %>%
  arrange(p_adj) %>%
  head(4)
# A tibble: 4 × 3
  protein     p_value     p_adj
  <chr>         <dbl>     <dbl>
1 FSTL1   0.000000466 0.0000466
2 C1QR1   0.000000479 0.0000479
3 DSC2    0.000129    0.0129   
4 HIF-1a  0.000196    0.0196   
Action

Implement the Benjamini-Hochberg correction

  1. Sort the raw \(p\)-values using arrange()
  2. Add a rank column of consecutive integers (neat trick: try using row_number())
  3. Add a column p_adj containing \(\frac{m}{i} p_{(i)}\) where \(m\) is the number of tests, \(i\) is the rank of the \(p\)-value, and \(p_{(i)}\) is the \(i\)th smallest \(p\)-value
  4. Find the collection of proteins with significantly different serum levels between the ASD and TD groups while controlling the false discovery rate at 1%.

Develop working code to execute 50 tests. Then use it to compute all 1317 tests.

Checklist

  • You’ve computed \(p\)-values iteratively using a loop, sapply , and nest+map .

  • You have commented codes in your script for each action item.

  • You’ve obtained a list of the significant proteins with 1% FDR.

  • You’ve saved your work somewhere where you can access it later.