```
# load packages
library(tidyverse)
library(tidymodels)
library(modelr)
library(rsample)
library(yardstick)
# read data
<- 'https://raw.githubusercontent.com/pstat197/pstat197a/main/materials/labs/lab4-logistic/data/biomarker_clean.csv'
url
<- c("DERM", "RELT", "IgD", "PTN", "FSTL1")
s_star <- read_csv(url) %>%
biomarker # subset to proteins of interest and group
select(group, any_of(s_star)) %>%
# convert group (chr) to binary (lgl)
mutate(class = (group == 'ASD')) %>%
select(-group)
```

# Measuring classification accuracy

In class you saw how to fit a logistic regression model using `glm()`

and some basic classification accuracy measures.

**Objectives**

In this lab you’ll carry out a more rigorous quantification of predictive accuracy by data partitioning. You’ll learn to use:

functions from the

`rsample`

package to partition data and retrieve partitionsfunctions from the

`modelr`

package to compute predictionsfunctions from the

`yardstick`

package for classification metrics

**Prerequisites**

Follow the action item below to get set up for the lab. You may need to install one or more packages if the `library()`

calls return an error.

**Setup**

- Create a new script for this lab in your labs directory.
- Copy-paste the code chunk below at the top of your script to load required packages and data.

## Data partitioning

In class we fit a logistic regression model to the data and evaluated classification accuracy on the very same data.

Because the parameter estimates optimize errors on the data used to fit the model, evaluating accuracy on the same data gives an overly optimistic assessment, because the errors have been made as small as possible.

** Data partitioning** consists in setting aside a random subset of observations that will be used

*only*to assess predictive accuracy and

*not*to fit any models. The held out data is treated as a ‘test’ set of observations to try to predict. This provides a more realistic assessment of predictive accuracy that is closer to what can be expected if genuinely new data is collected.

Partitioning is easy to do using `rsample::initial_split`

and specifying the proportion of data that should be retained for model fitting (the ‘training’ set).

Partitions are computed at random, so for reproducibility it is necessary to set the RNG seed at a fixed value.

**Partition the biomarker data into training and test sets**

Copy-paste the code chunk below into your script and execute once.

*Remarks:*

`set.seed()`

needs to be executed*together*with the lines that follow to ensure the same result is rendered every timethe output simply summarizes the partitions

```
# for reproducibility
set.seed(102022)
# partition data
<- biomarker %>%
partitions initial_split(prop = 0.8)
# examine
partitions
```

```
<Training/Testing/Total>
<123/31/154>
```

To retrieve the data partitions, one needs to use the helper functions `training()`

and `testing()`

:

```
# training set
training(partitions) %>% head(4)
```

```
# A tibble: 4 × 6
DERM RELT IgD PTN FSTL1 class
<dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 -0.409 0.108 1.82 -0.457 -1.31 TRUE
2 1.25 0.802 -0.461 0.910 0.973 FALSE
3 -0.697 -0.921 -1.26 0.0444 -1.46 TRUE
4 -0.591 1.34 2.31 0.269 0.802 TRUE
```

```
# testing set
testing(partitions) %>% head(4)
```

```
# A tibble: 4 × 6
DERM RELT IgD PTN FSTL1 class
<dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 0.140 -0.586 -1.53 -1.72 -1.96 TRUE
2 -0.276 0.410 0.454 -0.0375 -1.41 TRUE
3 0.927 -1.12 1.15 -0.290 0.190 TRUE
4 -0.00743 -0.319 -0.762 -1.27 -1.05 TRUE
```

## Model fitting

Fitting a logistic regression model is, as you saw in class, a one-line affair:

```
# fit glm
<- glm(class ~ .,
fit data = biomarker,
family = binomial(link = "logit"))
```

The `glm()`

function can fit many kinds of *generalized linear models*. Logistic regression is just one of this class of models in which:

the response follows a

*binomial*distribution (the Bernoulli distribution is the binomial with \(n = 1\))the log-odds or

*logit*transformation of the event/class probability \(p_i\) is linear in the predictors

The parameter estimates reported in tabular form are:

`tidy(fit)`

```
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.0933 0.199 -0.468 0.640
2 DERM -0.603 0.280 -2.16 0.0311
3 RELT -0.438 0.286 -1.53 0.126
4 IgD -0.662 0.213 -3.11 0.00189
5 PTN -0.234 0.273 -0.857 0.392
6 FSTL1 -0.360 0.259 -1.39 0.165
```

**Fit the model and interpret a parameter**

- Copy-paste the code above in your script and execute to fit the logistic regression model.
- Confer with your neighbor and interpret one of the parameters of your choosing.

*By interpret, we mean: what is the estimated change in P(ASD) associated with a +1SD change in log protein level?*

## Predictions

The `modelr`

package makes it relatively easy to compute predictions for a wide range of model objects in R. Its pipe-friendly `add_predictions(.df, .mdl, type)`

function will add a column of predictions of type `type`

using model `.mdl`

to data frame `.df`

.

```
# compute predictions on the test set
testing(partitions) %>%
add_predictions(fit)
```

```
# A tibble: 31 × 7
DERM RELT IgD PTN FSTL1 class pred
<dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
1 0.140 -0.586 -1.53 -1.72 -1.96 TRUE 2.20
2 -0.276 0.410 0.454 -0.0375 -1.41 TRUE 0.108
3 0.927 -1.12 1.15 -0.290 0.190 TRUE -0.925
4 -0.00743 -0.319 -0.762 -1.27 -1.05 TRUE 1.23
5 -0.484 -0.612 -0.625 -0.990 -1.16 TRUE 1.53
6 0.111 -0.489 -1.15 0.0483 -1.12 TRUE 1.21
7 -0.972 -0.535 0.502 -0.594 -0.427 TRUE 0.687
8 -0.344 0.0337 0.998 -2.00 -0.309 TRUE 0.0179
9 -1.13 -0.385 -0.929 0.892 -0.597 TRUE 1.38
10 0.398 0.325 -1.17 -0.0114 -0.415 TRUE 0.450
# … with 21 more rows
```

Inspect the `pred`

column. Notice that the predictions are not classes or probabilities. The default type of predictions are log-odds. One could back-transform:

```
# manually transform to probabilities
testing(partitions) %>%
add_predictions(fit) %>%
mutate(probs = 1/(1 + exp(-pred))) %>%
select(class, pred, probs) %>%
head(5)
```

```
# A tibble: 5 × 3
class pred probs
<lgl> <dbl> <dbl>
1 TRUE 2.20 0.900
2 TRUE 0.108 0.527
3 TRUE -0.925 0.284
4 TRUE 1.23 0.774
5 TRUE 1.53 0.822
```

Or simply change the `type`

of predictions to `response`

in order to obtain predicted probabilities:

```
# predict on scale of response
testing(partitions) %>%
add_predictions(fit, type = 'response') %>%
select(class, pred) %>%
head(5)
```

```
# A tibble: 5 × 2
class pred
<lgl> <dbl>
1 TRUE 0.900
2 TRUE 0.527
3 TRUE 0.284
4 TRUE 0.774
5 TRUE 0.822
```

If we want to convert these predicted class probabilities into predicted classes, we can simply define a new variable based on whether the probabilities exceed 0.5 (or any other threshold):

```
# predict classes
testing(partitions) %>%
add_predictions(fit, type = 'response') %>%
mutate(pred.class = (pred > 0.5)) %>%
select(class, pred, pred.class) %>%
head(5)
```

```
# A tibble: 5 × 3
class pred pred.class
<lgl> <dbl> <lgl>
1 TRUE 0.900 TRUE
2 TRUE 0.527 TRUE
3 TRUE 0.284 FALSE
4 TRUE 0.774 TRUE
5 TRUE 0.822 TRUE
```

## Accuracy measures

The classification accuracy measures we discussed in class are based on tabulating observation counts when grouping by predicted and observed classes.

This tabulation can be done in a base-R way by piping a data frame of the predicted and observed classes into `table()`

:

```
# tabulate
testing(partitions) %>%
add_predictions(fit, type = 'response') %>%
mutate(pred.class = (pred > 0.5)) %>%
select(class, pred.class) %>%
table()
```

```
pred.class
class FALSE TRUE
FALSE 10 6
TRUE 3 12
```

However, the metrics discussed in class are somewhat painful to compute from the output above. Luckily, `yardstick`

makes that process easier: the package has specialized functions that compute each metric. One need only:

provide the predicted and true labels as factors

indicate which factor is the truth and which is the prediction

indicate which factor level is considered a ‘positive’

```
# store predictions as factors
<- testing(partitions) %>%
pred_df add_predictions(fit, type = 'response') %>%
mutate(pred.class = (pred > 0.5),
group = factor(class, labels = c('TD', 'ASD')),
pred.group = factor(pred.class, labels = c('TD', 'ASD')))
# check order of factor levels
%>% pull(group) %>% levels() pred_df
```

`[1] "TD" "ASD"`

```
# compute specificity
%>%
pred_df specificity(truth = group,
estimate = pred.group,
event_level = 'second')
```

```
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 specificity binary 0.625
```

The second level is ASD, which in this context is a positive. We knew since we supplied the labels in defining the factors, and the order of levels will match the order of labels. However, we can also check as above using `levels()`

. Hence, `event_level = 'second'`

.

Sensitivity, accuracy, and other metrics can be computed similarly:

```
# sensitivity
%>%
pred_df sensitivity(truth = group,
estimate = pred.group,
event_level = 'second')
```

```
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sensitivity binary 0.8
```

You can check the package documentation for a complete list of available metrics.

**Compute the accuracy**

Find the appropriate function from the package documentation (link immediately above) and use it to compute the accuracy.

*Remark:* from the table, you know the result should be \(\frac{10 + 12}{10 + 12 + 6 + 3} \approx 0.7097\) .

The package also has a helper function that allows you to define a panel of metrics so that you can compute several simultaneously. If we want a panel of specificity and sensitivity, the following will do the trick:

```
# define panel (arguments must be yardstick metric function names)
<- metric_set(sensitivity, specificity)
panel_fn
# compute
%>%
pred_df panel_fn(truth = group,
estimate = pred.group,
event_level = 'second')
```

```
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sensitivity binary 0.8
2 specificity binary 0.625
```

**Compute a panel of precision, recall, and F1 score**

- Find the appropriate yardstick functions and define a metric panel
- Compute on the test data

As a final comment, the table of classifications can be obtained in `yardstick`

using the `conf_mat()`

function. (The cross-classification table of predicted versus actual classes is called a *confusion matrix*.)

`%>% conf_mat(truth = group, estimate = pred.group) pred_df `

```
Truth
Prediction TD ASD
TD 10 3
ASD 6 12
```

## Checklist

- You partitioned the data into training and test sets
- You fit a logistic regression model using the training set
- You evaluated accuracy on the test set

## Extras

If there is extra time, or you’re interested in exploring a bit further on your own, read on.

### ROC curves

The `yardstick`

package also supplies functions for computing class probability metrics based on the estimated class probabilities rather than the estimated classes. ROC curves and AUROC are examples.

`roc_curve()`

will find all unique probability thresholds and, for each threshold, calculate sensitivity and specificity:

```
%>%
pred_df roc_curve(truth = group, estimate = pred)
```

```
# A tibble: 33 × 3
.threshold specificity sensitivity
<dbl> <dbl> <dbl>
1 -Inf 0 1
2 0.0664 0 1
3 0.117 0 0.938
4 0.158 0 0.875
5 0.186 0.0667 0.875
6 0.191 0.0667 0.812
7 0.197 0.0667 0.75
8 0.262 0.0667 0.688
9 0.284 0.0667 0.625
10 0.299 0.133 0.625
# … with 23 more rows
```

This can be used to plot the ROC curve:

```
%>%
pred_df roc_curve(truth = group,
estimate = pred,
event_level = 'second') %>%
ggplot(aes(y = sensitivity, x = 1 - specificity)) +
geom_path() +
geom_abline(slope = 1, intercept = 0)
```

In this case the ROC curve is so choppy because the test set only includes 31 observations. In general, for a collection of \(n\) observations there are at most \(n + 1\) unique thresholds and usually considerably fewer.

The area under the ROC curve is also easy to compute:

```
%>% roc_auc(truth = group,
pred_df estimate = pred,
event_level = 'second')
```

```
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.712
```

### Combined metric types

If you wish to compute *both* classification metrics based on a class prediction *and* class probability metrics based on a probability prediction in a metric panel, the classification should be provided as the argument to `estimate = ...`

and class probability column can be provided as an unnamed argument following the `estimate`

argument.

For example:

```
<- metric_set(roc_auc, accuracy)
panel
%>% panel(truth = group,
pred_df estimate = pred.group,
pred,event_level = 'second')
```

```
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.710
2 roc_auc binary 0.712
```

### Exploring partitions

Review this section if you want a deeper understanding of data partitioning.

The rationale for partitioning is that held out data will give a more honest assessment. Conversely, evaluating accuracy on data used to fit a model will provide an overly optimistic assessment.

We can experimentally confirm this intuition by:

- repeatedly partitioning the data at random
- evaluating accuracy on both partitions
- averaging across partitions

This procedure will reveal that *on average* the accuracy is better on training data. Don’t worry too much about the computations; focus on the output and the concepts.

We’ll use the tidyverse iteration strategy from lab 3. First we’ll need some helper functions that are basically wrappers around each step we went through in this lab:

- fitting a model
- adding predictions
- evaluating metrics

```
# define some helper functions
<- function(.df){
fit_fn glm(class ~ ., family = 'binomial', data = .df)
}
<- function(.df, .mdl){
pred_fn %>% add_predictions(.mdl, type = 'response')
.df
}
<- metric_set(sensitivity, specificity, accuracy, roc_auc)
panel
<- function(.df){
eval_fn %>%
.df mutate(group = factor(class, labels = c('TD', 'ASD')),
pred.group = factor(pred > 0.5, labels = c('TD', 'ASD'))) %>%
panel(truth = group,
estimate = pred.group,
pred,event_level = 'second')
}
```

Now let’s create 400 random partitions of the data and perform the steps in this lab for every partition. In addition, we’ll compute predictions and evaluate accuracy on the training set, which we did not do above and is generally not done.

```
set.seed(101922)
<- 400
n_splits <- tibble(partition = 1:n_splits,
out split = map(partition, ~ initial_split(biomarker, prop = 0.8)),
train = map(split, training),
test = map(split, testing),
fit = map(train, fit_fn),
pred_test = map2(test, fit, pred_fn),
pred_train = map2(train, fit, pred_fn),
eval_test = map(pred_test, eval_fn),
eval_train = map(pred_train, eval_fn))
%>% head(4) out
```

```
# A tibble: 4 × 9
partition split train test fit pred_test pred_train
<int> <list> <list> <list> <list> <list> <list>
1 1 <split [123/31]> <tibble> <tibble> <glm> <tibble> <tibble>
2 2 <split [123/31]> <tibble> <tibble> <glm> <tibble> <tibble>
3 3 <split [123/31]> <tibble> <tibble> <glm> <tibble> <tibble>
4 4 <split [123/31]> <tibble> <tibble> <glm> <tibble> <tibble>
# … with 2 more variables: eval_test <list>, eval_train <list>
```

We can extract the accuracy of predictions on each of the training and test sets as follows:

```
<- out %>%
test_accuracy select(partition, contains('eval')) %>%
unnest(eval_test) %>%
select(partition, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
<- out %>%
train_accuracy select(partition, contains('eval')) %>%
unnest(eval_train) %>%
select(partition, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
%>% head(4) test_accuracy
```

```
# A tibble: 4 × 5
partition sensitivity specificity accuracy roc_auc
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.6 0.562 0.581 0.675
2 2 0.778 0.538 0.677 0.645
3 3 0.684 0.917 0.774 0.846
4 4 0.684 0.833 0.742 0.842
```

`%>% head(4) train_accuracy `

```
# A tibble: 4 × 5
partition sensitivity specificity accuracy roc_auc
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.803 0.806 0.805 0.858
2 2 0.741 0.831 0.789 0.858
3 3 0.737 0.742 0.740 0.825
4 4 0.719 0.773 0.748 0.825
```

Lastly, we can average the metrics over all partitions and also check the variability across partitions. We’ll start with the training set:

```
<- train_accuracy %>%
train_summaries rename(roc.auc = roc_auc) %>%
select(-partition) %>%
summarize_all(.funs = list(mean = mean, sd = sd)) %>%
gather() %>%
separate(key, into = c('metric', 'stat'), sep = '_') %>%
spread(stat, value)
```

Now compute the average and variability on the test set:

```
<- test_accuracy %>%
test_summaries rename(roc.auc = roc_auc) %>%
select(-partition) %>%
summarize_all(.funs = list(mean = mean, sd = sd)) %>%
gather() %>%
separate(key, into = c('metric', 'stat'), sep = '_') %>%
spread(stat, value)
```

Now let’s put them side-by-side:

```
left_join(train_summaries,
test_summaries,by = 'metric',
suffix = c('.train', '.test')) %>%
select(metric, contains('mean'), contains('sd')) %>%
::kable() knitr
```

metric | mean.train | mean.test | sd.train | sd.test |
---|---|---|---|---|

accuracy | 0.7573374 | 0.7225806 | 0.0207045 | 0.0702816 |

roc.auc | 0.8348238 | 0.7994956 | 0.0158188 | 0.0706822 |

sensitivity | 0.7565383 | 0.7093282 | 0.0308425 | 0.1073645 |

specificity | 0.7569019 | 0.7416470 | 0.0282898 | 0.1148716 |

Notice that:

- The apparent accuracy according to all metrics is higher on average on the training data across partitionings
- The accuracy metrics on the test data are more variable across partitionings

This behavior is the justification for data partitioning: evaluating predictions on the same data that was used to fit the model *overestimates* the accuracy compared with data that was not used in fitting.