Variable selection with the LASSO

PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz



  • Office hours follow all class and section meetings in building 434 room 122.

    • Enter through courtyard, not on Storke side

Design assessment

10 Minute Activity

With your table, share your diagrams.

  • identify common elements/aspects

  • pick one or a blend that you like best and sketch it on the whiteboard

Design assessment

How would you score the proteomics analysis along the following dimensions?

  • reproducibility: how easy is it to reproduce the results exactly?

  • transparency: how widely distributed is potential variability in results across the elements of the data analysis?

  • replicability: how likely is it that if the study were conducted with new subjects, the same panel would be obtained?

An alternative analysis

Two steps

At a very high level, the analysis we’ve covered is a ‘two-step’ procedure:

  1. (selection step) select variables
  2. (estimation step) train a classifier using selected variables

A likely reason for this approach is that one cannot fit a regression model with 1317 predictors based on only 154 data points (\(p > n\)).

A schematic

Here’s a representation of the design of the analysis procedure.

G cluster1 statistical modeling cluster2 outputs data data fit logistic regression data->fit s1 multiple testing data->s1 s2 correlation analysis data->s2 s3 random forest data->s3 join data->join multiple partitioning selec ensemble selection selec->fit sel selected variables fit->sel fit->join s1->selec s2->selec s3->selec pred accuracy quantification join->pred
  • lots of failure modes

  • repeated use of the same data for different sub-analyses

What about one step?

If we did simultaneous selection and estimation, our schematic might look like this:

G cluster1 data partitioning cluster2 statistical modeling cluster3 outputs data data train training set data->train test test set data->test fit selection and estimation train->fit join test->join sel selected variables fit->sel fit->join pred accuracy quantification join->pred

Sparse estimation

Imagine a logistic regression model with all 1317 predictors:

\[ \log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \sum_{j = 1}^{1317} \beta_j x_{ij} \]

Suppose we also assume sparsity constraint: a condition that most of them must be zero.

As long as the constraint is strong enough so that the number of nonzero coefficients is \(n\) or fewer (i.e., \(\|\beta\|_0 < n\)), one can compute estimates.


The Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani (1996) ) is the most widely-used sparse regression estimator.

Mathematically, the LASSO is the \(L_1\)-constrained MLE, i.e., the solution to the optimization problem:

\[ \begin{aligned} \text{maximize}\quad &\mathcal{L}(\beta; x, y) \\ \text{subject to}\quad &\|\beta\|_1 < t \end{aligned} \]

Lagrangian form

The solution to the constrained problem

\[ \begin{aligned} \text{maximize}\quad &\mathcal{L}(\beta; x, y) \\ \text{subject to}\quad &\|\beta\|_1 < t \end{aligned} \]

can be expressed as the unconstrained optimization

\[ \hat{\beta} = \arg\min_\beta \left\{-\mathcal{L}(\beta; x, y) + \lambda\|\beta\|_1\right\} \]

where \(\lambda\) is a Lagrange multiplier.

Toy example

I simulated 250 observations according to a logistic regression model with 13 predictors, all but 3 of which had zero coefficients:

\[ \beta = \left[ 0\; 3\; 0\; 0\; 0\; 0\; 0\; 0\; 0\; 0\; 0\; 2\; 2\right]^T \]

And computed LASSO estimates for a few values of \(\lambda\).

# A tibble: 14 × 4
   term  truth lambda1 lambda2
   <chr> <dbl>   <dbl>   <dbl>
 1 Int    -0.5  -0.171  -0.206
 2 V1      0     0       0    
 3 V2      3     1.49    1.96 
 4 V3      0     0       0    
 5 V4      0     0       0    
 6 V5      0     0       0    
 7 V6      0     0       0    
 8 V7      0     0       0    
 9 V8      0     0       0.084
10 V9      0     0       0    
11 V10     0     0       0    
12 V11     0     0      -0.069
13 V12     2     0.892   1.29 
14 V13     2     1.01    1.40 

Constraint strength

In practice, \(\lambda\) controls the strength of the constraint and thus the sparsity:

  • Larger \(\lambda\) ➜ stronger constraint ➜ more sparse

  • Smaller \(\lambda\) ➜ weaker constraint ➜ less sparse

Selecting \(\lambda\)

How to choose the strength of constraint?

  1. Compute estimates for a “path” \(\lambda_1 > \lambda_2 > \cdots > \lambda_K\)
  2. Choose \(\lambda^*\) that minimizes an error metric.

Typically prediction error is used as the metric, and estimated by:

  • partitioning the data many times

  • averaging test set predictive accuracy across partitions

Regularization profile: estimates

Value of coefficient estimates as a function of regularization strength for the toy example. Each path corresponds to one model coefficient. Vertical line indicates optimal strength.

Regularization profile: error

Prediction error metric (‘deviance’, similar to RMSE) as a function of regularization strength. Two plausible choices shown – one more conservative, one less conservative.

\(\lambda\) selection

Estimate profile with optimal choice indicated by vertical line.

Coefficient estimates

term estimate truth
(Intercept) -0.1712466 -0.5
V2 1.4867846 3.0
V12 0.8922530 2.0
V13 1.0142979 2.0

Recomputing estimates

Notice that the estimate magnitudes are off by a considerable amount. If we recompute estimates without the constraint:

term estimate std.error statistic p.value
(Intercept) -0.2620012 0.1894367 -1.383054 0.1666482
V2 2.5653900 0.3313415 7.742434 0.0000000
V12 1.7554670 0.2786778 6.299271 0.0000000
V13 1.8323539 0.2632716 6.959937 0.0000000


Now if I simulate some new observations:

.metric .estimator .estimate
sensitivity binary 0.8500000
specificity binary 0.9000000
roc_auc binary 0.9508333

Alternative analysis

Log-transform, center and scale, but don’t trim outliers.


# read in data
biomarker <- read_csv('data/biomarker-clean.csv') %>%
  select(-ados) %>%
  mutate(across(-group, ~scale(log(.x))[,1]),
         class = as.numeric(group == 'ASD'))

# partition
partitions <- biomarker %>%
  initial_split(prop = 0.8)

x_train <- training(partitions) %>%
  select(-group, -class) %>%
y_train <- training(partitions) %>%

Estimated error

# reproducibility

# multiple partitioning for lambda selection
cv_out <- cv.glmnet(x_train, 
                    family = 'binomial', 
                    nfolds = 5, 
                    type.measure = 'deviance')

cvout_df <- tidy(cv_out) 

Coefficient paths

# LASSO estimates
fit <- glmnet(x_train, y_train, family = 'binomial')
fit_df <- tidy(fit)

After dropping the penalty and recomputing estimates, the final model is:

term estimate std.error statistic p.value
(Intercept) -0.0404465 0.2312305 -0.1749184 0.8611438
MAPK2 -0.9896079 0.2831955 -3.4944340 0.0004751
IgD -0.8264133 0.2471363 -3.3439569 0.0008259
DERM -0.9971069 0.2758803 -3.6142740 0.0003012
.metric .estimator .estimate
sensitivity binary 0.8125000
specificity binary 0.7333333
roc_auc binary 0.8375000

Some final thoughts

  1. Regardless of method, classification based on serum levels seems to achieve 70-80% accuracy out of sample.
  2. Not specific enough to be used as a diagnostic tool, but may be helpful in early detection.
  3. Published analysis seems a little over-complicated; computationally intensive methods are less transparent and more difficult to reproduce.

Concepts discussed

  • multiple hypothesis testing, FWER and FDR control

  • classification using logistic regression, random forests

  • variable selection using LASSO regularization

  • data partitioning and accuracy quantification

Transferrable skills

  • iterative computations in R, three ways

  • fitting a logistic regression model with glm()

  • use of yardstick and rsample for quantifying classification accuracy

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.