PSTAT197A/CMPSC190DD Fall 2022
UCSB
Office hours follow all class and section meetings in building 434 room 122.
With your table, share your diagrams.
identify common elements/aspects
pick one or a blend that you like best and sketch it on the whiteboard
How would you score the proteomics analysis along the following dimensions?
reproducibility: how easy is it to reproduce the results exactly?
transparency: how widely distributed is potential variability in results across the elements of the data analysis?
replicability: how likely is it that if the study were conducted with new subjects, the same panel would be obtained?
At a very high level, the analysis we’ve covered is a ‘two-step’ procedure:
A likely reason for this approach is that one cannot fit a regression model with 1317 predictors based on only 154 data points (\(p > n\)).
Here’s a representation of the design of the analysis procedure.
lots of failure modes
repeated use of the same data for different sub-analyses
If we did simultaneous selection and estimation, our schematic might look like this:
Imagine a logistic regression model with all 1317 predictors:
\[ \log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \sum_{j = 1}^{1317} \beta_j x_{ij} \]
Suppose we also assume sparsity constraint: a condition that most of them must be zero.
As long as the constraint is strong enough so that the number of nonzero coefficients is \(n\) or fewer (i.e., \(\|\beta\|_0 < n\)), one can compute estimates.
The Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani (1996) ) is the most widely-used sparse regression estimator.
Mathematically, the LASSO is the \(L_1\)-constrained MLE, i.e., the solution to the optimization problem:
\[ \begin{aligned} \text{maximize}\quad &\mathcal{L}(\beta; x, y) \\ \text{subject to}\quad &\|\beta\|_1 < t \end{aligned} \]
The solution to the constrained problem
\[ \begin{aligned} \text{maximize}\quad &\mathcal{L}(\beta; x, y) \\ \text{subject to}\quad &\|\beta\|_1 < t \end{aligned} \]
can be expressed as the unconstrained optimization
\[ \hat{\beta} = \arg\min_\beta \left\{-\mathcal{L}(\beta; x, y) + \lambda\|\beta\|_1\right\} \]
where \(\lambda\) is a Lagrange multiplier.
I simulated 250 observations according to a logistic regression model with 13 predictors, all but 3 of which had zero coefficients:
\[ \beta = \left[ 0\; 3\; 0\; 0\; 0\; 0\; 0\; 0\; 0\; 0\; 0\; 2\; 2\right]^T \]
And computed LASSO estimates for a few values of \(\lambda\).
# A tibble: 14 × 4
term truth lambda1 lambda2
<chr> <dbl> <dbl> <dbl>
1 Int -0.5 -0.171 -0.206
2 V1 0 0 0
3 V2 3 1.49 1.96
4 V3 0 0 0
5 V4 0 0 0
6 V5 0 0 0
7 V6 0 0 0
8 V7 0 0 0
9 V8 0 0 0.084
10 V9 0 0 0
11 V10 0 0 0
12 V11 0 0 -0.069
13 V12 2 0.892 1.29
14 V13 2 1.01 1.40
In practice, \(\lambda\) controls the strength of the constraint and thus the sparsity:
Larger \(\lambda\) ➜ stronger constraint ➜ more sparse
Smaller \(\lambda\) ➜ weaker constraint ➜ less sparse
How to choose the strength of constraint?
Typically prediction error is used as the metric, and estimated by:
partitioning the data many times
averaging test set predictive accuracy across partitions
term | estimate | truth |
---|---|---|
(Intercept) | -0.1712466 | -0.5 |
V2 | 1.4867846 | 3.0 |
V12 | 0.8922530 | 2.0 |
V13 | 1.0142979 | 2.0 |
Notice that the estimate magnitudes are off by a considerable amount. If we recompute estimates without the constraint:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.2620012 | 0.1894367 | -1.383054 | 0.1666482 |
V2 | 2.5653900 | 0.3313415 | 7.742434 | 0.0000000 |
V12 | 1.7554670 | 0.2786778 | 6.299271 | 0.0000000 |
V13 | 1.8323539 | 0.2632716 | 6.959937 | 0.0000000 |
Now if I simulate some new observations:
.metric | .estimator | .estimate |
---|---|---|
sensitivity | binary | 0.8500000 |
specificity | binary | 0.9000000 |
roc_auc | binary | 0.9508333 |
Log-transform, center and scale, but don’t trim outliers.
library(tidymodels)
# read in data
biomarker <- read_csv('data/biomarker-clean.csv') %>%
select(-ados) %>%
mutate(across(-group, ~scale(log(.x))[,1]),
class = as.numeric(group == 'ASD'))
# partition
set.seed(101622)
partitions <- biomarker %>%
initial_split(prop = 0.8)
x_train <- training(partitions) %>%
select(-group, -class) %>%
as.matrix()
y_train <- training(partitions) %>%
pull(class)
After dropping the penalty and recomputing estimates, the final model is:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0404465 | 0.2312305 | -0.1749184 | 0.8611438 |
MAPK2 | -0.9896079 | 0.2831955 | -3.4944340 | 0.0004751 |
IgD | -0.8264133 | 0.2471363 | -3.3439569 | 0.0008259 |
DERM | -0.9971069 | 0.2758803 | -3.6142740 | 0.0003012 |
.metric | .estimator | .estimate |
---|---|---|
sensitivity | binary | 0.8125000 |
specificity | binary | 0.7333333 |
roc_auc | binary | 0.8375000 |
multiple hypothesis testing, FWER and FDR control
classification using logistic regression, random forests
variable selection using LASSO regularization
data partitioning and accuracy quantification
iterative computations in R, three ways
fitting a logistic regression model with glm()
use of yardstick
and rsample
for quantifying classification accuracy