Activity: multinomial logistic regression

Setup

While we are getting started, on your table’s workstation open RStudio and execute the following command:

source('https://raw.githubusercontent.com/pstat197/pstat197a/main/materials/scripts/package-installs.R')

Then open a new script, copy-paste the code chunk below, and execute once.

# packages
library(tidyverse)
library(tidymodels)
library(modelr)
library(Matrix)
library(sparsesvd)
library(glmnet)

# path to activity files on repo
url <- 'https://raw.githubusercontent.com/pstat197/pstat197a/main/materials/activities/data/'

# load a few functions for the activity
source(paste(url, 'projection-functions.R', sep = ''))

# read in data
claims <- paste(url, 'claims-multi-tfidf.csv', sep = '') %>%
  read_csv()

# preview
claims
# A tibble: 552 × 15,871
   mclass    .id   bclass  adams afternoon  agent android    app arkansas arrest
   <chr>     <chr> <chr>   <dbl>     <dbl>  <dbl>   <dbl>  <dbl>    <dbl>  <dbl>
 1 unlawful  url1  relev… 0.0692    0.0300 0.0365  0.0450 0.0330   0.0390 0.0140
 2 irreleva… url3  irrel… 0         0      0       0      0        0      0     
 3 other     url4  relev… 0         0      0       0      0        0      0     
 4 fatality  url5  relev… 0         0      0       0      0        0      0     
 5 irreleva… url7  irrel… 0         0      0       0      0        0      0.0107
 6 fatality  url8  relev… 0         0      0       0      0        0      0     
 7 irreleva… url9  irrel… 0         0      0       0      0        0      0     
 8 irreleva… url10 irrel… 0         0      0       0      0        0      0     
 9 unlawful  url11 relev… 0         0      0       0      0        0      0.0716
10 unlawful  url12 relev… 0         0      0       0      0        0      0.0161
# … with 542 more rows, and 15,861 more variables: arrive <dbl>, assist <dbl>,
#   attic <dbl>, barricade <dbl>, block <dbl>, blytheville <dbl>, burn <dbl>,
#   captain <dbl>, catch <dbl>, check <dbl>, chemical <dbl>, copyright <dbl>,
#   county <dbl>, custody <dbl>, dehydration <dbl>, demand <dbl>,
#   department <dbl>, desktop <dbl>, device <dbl>, dispute <dbl>,
#   division <dbl>, drug <dbl>, enter <dbl>, exit <dbl>, family <dbl>,
#   federal <dbl>, fire <dbl>, force <dbl>, gas <dbl>, gray <dbl>, …

Activity 1 (10 min)

You’ll be given about ten minutes to do the following on your group’s workstation.

  1. Partition the data into training and test sets.
  2. Using the training data, find principal components that preserve at least 80% of the total variance and project the data onto those PCs.
  3. Fit a logistic regression model to the training data.

Step 1: partitioning

This should be familiar from last week’s lab. Use the code chunk below to partition the data. Do not change the RNG seed or split proportion!

# partition data
set.seed(102722)
partitions <- claims %>% initial_split(prop = 0.8)

# separate DTM from labels
test_dtm <- testing(partitions) %>%
  select(-.id, -bclass, -mclass)
test_labels <- testing(partitions) %>%
  select(.id, bclass, mclass)

# same, training set
train_dtm <- training(partitions) %>%
  select(-.id, -bclass, -mclass)
train_labels <- training(partitions) %>%
  select(.id, bclass, mclass)

Note that we have separated the document term matrix (DTM) from the labels for both partitions. When we project the data onto a subspace, we only want to project the DTM and not the labels.

Step 2: projection

Now find the number of principal components that capture at least 70% of variation and project the document term matrix (DTM) onto those components. Use the custom function projection_fn(.dtm, .prop) .

# find projections based on training data
proj_out <- projection_fn(.dtm = train_dtm, .prop = 0.7)
train_dtm_projected <- proj_out$data

# how many components were used?
proj_out$n_pc

Note: projections were found using the training data only. The test data will ultimately be projected onto the same components, as if it were new information we were feeding into a predictive model developed entirely using the training data.

Step 3: regression

Bind the binary labels to the projected document term matrix and fit a logistic regression model.

The code chunk below gives you the input data frame you need to use glm(). It’s up to you to specify the other arguments needed to fit the model.

train <- train_labels %>%
  transmute(bclass = factor(bclass)) %>%
  bind_cols(train_dtm_projected)

fit <- glm(..., data = train, ...)

You will most likely get a warning of some kind – that’s expected. Take note of what the warning says and stop here.

Important

Briefly discuss with your table: what do you think the warning means?

Activity 2 (10 min)

This part will guide you through the following steps.

  1. Fit a logistic regression model with an elastic net penalty to the training data.
  2. Quantify classification accuracy on the test data using sensitivity, specificity, and AUROC.

Step 1: fit a regularized logistic regression

glmnet implements the elastic net penalty when a parameter alpha is provided. In the function call, a predictor matrix and response vector are used to specify the model instead of a formula.

Use the code chunk below to fit the model for a path of regularization strengths, select a strength, and extract the fitted model corresponding to that strength. Do not adjust the RNG seed.

# store predictors and response as matrix and vector
x_train <- train %>% select(-bclass) %>% as.matrix()
y_train <- train_labels %>% pull(bclass)

# fit enet model
alpha_enet <- 0.3
fit_reg <- glmnet(x = x_train, 
                  y = y_train, 
                  family = 'binomial',
                  alpha = alpha_enet)

# choose a strength by cross-validation
set.seed(102722)
cvout <- cv.glmnet(x = x_train, 
                y = y_train, 
                family = 'binomial',
                alpha = alpha_enet)

# store optimal strength
lambda_opt <- cvout$lambda.min

# view results
cvout
Note

Comment. The elastic net parameter alpha controls the balance between ridge and LASSO penalties: alpha = 0 corresponds to ridge regression, alpha = 1 corresponds to LASSO, and all other values specify a mixture. When the parameter is closer to 1, the LASSO penalty is stronger relative to ridge; and vice-versa when it’s closer to 0.

Step 2: prediction

To compute predictions, we’ll need to project the test data onto the same directions used to transform the training data.

Once that’s done, we can simply feed the projected test data, the fitted model fit_reg, and the optimal strength lambda_opt to a predict() call.

# project test data onto PCs
test_dtm_projected <- reproject_fn(.dtm = test_dtm, proj_out)

# coerce to matrix
x_test <- as.matrix(test_dtm_projected)

# compute predicted probabilities
preds <- predict(fit_reg, 
                 s = lambda_opt, 
                 newx = x_test,
                 type = 'response')

Next bind the test labels to the predictions:

# store predictions in a data frame with true labels
pred_df <- test_labels %>%
  transmute(bclass = factor(bclass)) %>%
  bind_cols(pred = as.numeric(preds)) %>%
  mutate(bclass.pred = factor(pred > 0.5, 
                              labels = levels(bclass)))

# define classification metric panel 
panel <- metric_set(sensitivity, 
                    specificity, 
                    accuracy, 
                    roc_auc)

# compute test set accuracy
pred_df %>% panel(truth = bclass, 
                  estimate = bclass.pred, 
                  pred, 
                  event_level = 'second')
Important

Briefly discuss with your table:

  1. How satisfied are you with the predictive performance?
  2. Does the classifier do a better job picking out relevant pages or irrelevant pages?

Activity 3 (10 min)

Now we’ll fit a multinomial logistic regression model using the multiclass labels rather than the binary ones, still using regularization to prevent overfitting.

Step 1: multinomial regression

Use the code chunk below to do the fitting. Notice that it’s as simple as supplying the multiclass labels and changing the family = 'binomial' to family = 'multinomial' , but the number of non-intercept parameters is now

\[ \text{number of predictors} \times (\text{number of classes} - 1) \]

So in our case, the logistic regression model had \(p = 55\) , but when we fit a multinomial model to the data using labels with \(k = 5\) classes, we have \(p(k - 1) = 220\) parameters!

# get multiclass labels
y_train_multi <- train_labels %>% pull(mclass)

# fit enet model
alpha_enet <- 0.2
fit_reg_multi <- glmnet(x = x_train, 
                  y = y_train_multi, 
                  family = 'multinomial',
                  alpha = alpha_enet)

# choose a strength by cross-validation
set.seed(102722)
cvout_multi <- cv.glmnet(x = x_train, 
                   y = y_train_multi, 
                   family = 'multinomial',
                   alpha = alpha_enet)

# view results
cvout

Step 2: predictions

The predictions from this model are a set of proabilities, one per class:

preds_multi <- predict(fit_reg_multi, 
        s = cvout_multi$lambda.min, 
        newx = x_test,
        type = 'response')

as_tibble(preds_multi[, , 1]) 

If we choose the most probable class as the prediction and cross-tabulate with the actual label, we end up with the following table:

pred_class <- as_tibble(preds_multi[, , 1]) %>% 
  mutate(row = row_number()) %>%
  pivot_longer(-row, 
               names_to = 'label',
               values_to = 'probability') %>%
  group_by(row) %>%
  slice_max(probability, n = 1) %>%
  pull(label)

pred_tbl <- table(pull(test_labels, mclass), pred_class)

pred_tbl
Important

If time, take a moment to discuss:

  1. What do you think of the overall accuracy?
  2. Which classes are well-predicted and which are not?
  3. Do you prefer the logistic or multinomial regression and why?