6 - Postprocessing

Getting More Out of Feature Engineering and Tuning for Machine Learning

Startup!

library(tidymodels)
library(desirability2)
library(probably)
library(mirai)

# check torch:
if (torch::torch_is_installed()) {
  library(torch)
}

tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)
daemons(parallel::detectCores())

More startup!

# Load our example data for this section
"https://raw.githubusercontent.com/tidymodels/" |> 
  paste0("workshops/main/slides/class_data.RData") |> 
  url() |> 
  load()

set.seed(429)
sim_split <- initial_split(class_data, prop = 0.75, strata = class)
sim_train <- training(sim_split)
sim_test  <- testing(sim_split)

set.seed(523)
sim_rs <- vfold_cv(sim_train, v = 10, strata = class)

Our neural network model

rec <- 
  recipe(class ~ ., data = sim_train) |> 
  step_normalize(all_numeric_predictors())

nnet_spec <- 
  mlp(hidden_units = tune(), penalty = tune(), learn_rate = tune(), 
      epochs = 100, activation = tune()) |> 
  # Remove the class_weights argument
  set_engine("brulee", stop_iter = 10) |> 
  set_mode("classification")
  
nnet_wflow <- workflow(rec, nnet_spec)

nnet_param <- 
  nnet_wflow |> 
  extract_parameter_set_dials()   

What is postprocessing?

Adjusting model predictions

How can we modify our predictions? Some examples:

  • Fixing calibration issues*.
  • Limit the range of predictions.
  • Alternative cutoffs for binary data.
  • Declining to predict.

* Requires further estimation (and data).


Let’s first consider the easiest case: alternative cutoffs.

Alternative thresholds

Instead of up-weighting the samples in the minority (via class_weights), we can try to fit the best model and then define what it means to be an “event.”


Instead of using a 50% threshold, we might lower the level of evidence needed to call a prediction an event.


How do we tune the threshold?

Tailors

The tailor package is similar to recipes but specifies how to adjust predictions.

A simple example:

thrsh_tlr <-
  tailor() |>
  adjust_probability_threshold(threshold = 1 / 3)

thrsh_tlr
  • Like a recipe, this initial call doesn’t do anything but declare intent.

  • Unlike a recipe, it does not need the data (i.e., predictions) at this point.

    • Relevant prediction columns are selected when fit() is used (next slide).

Manual use of a tailor

There is a fit() method that requires data and the names of the prediction columns:

three_rows <- 
  tribble(
     ~ class, ~ .pred_class, ~.pred_event, ~.pred_nonevent,
     "event",       "event",          0.6,             0.4,
     "event",    "nonevent",          0.4,             0.6, 
  "nonevent",    "nonevent",          0.1,             0.9  
  ) |> 
  mutate(across(where(is.character), factor))

thrsh_fit <-
  thrsh_tlr |>
  fit(
    three_rows,
    outcome = class,
    estimate = .pred_class,
    .pred_event:.pred_nonevent  # No argument name and order matches factor levels
  )

Manual use of a tailor

predict() applies the adjustments:

thrsh_fit

predict(thrsh_fit, three_rows)
#> # A tibble: 3 × 4
#>   class    .pred_class .pred_event .pred_nonevent
#>   <fct>    <fct>             <dbl>          <dbl>
#> 1 event    event               0.6            0.4
#> 2 event    event               0.4            0.6
#> 3 nonevent nonevent            0.1            0.9


tailors within workflows

In practice, we would add the tailor to a workflow to make it easier to use:


nnet_wflow <- workflow(rec, nnet_spec, thrsh_tlr)


  • We don’t have to set the names of the outcome or prediction columns (yet).
  • fit() and predict() happen automatically.

Current adjustments

Some notes

  • Adjustment order matters; tailor will error early if the ordering rules are violated.

  • Adjustments that change class probabilities also affect hard class predictions.

  • Adjustments happen before performance estimation.

    • Undoing something like a log transformation is a bad idea here.
  • We have more calibration methods in mind.

Your turn

  • Discuss with those around you what the “ordering rules” could be.


03:00

More Notes

  • When estimation is required, the data considerations become more complex.

  • Most arguments can be tuned.

  • For grid search, we use a conditional execution algorithm that avoids redundant retraining of the preprocessor or model.

Back to our neural network

Tuning the probability threshold

thrsh_tlr <-
  tailor() |>
  adjust_probability_threshold(threshold = tune())

nnet_thrsh_wflow <- workflow(rec, nnet_spec, thrsh_tlr)
  
nnet_thrsh_param <- 
  nnet_thrsh_wflow |> 
  extract_parameter_set_dials() |> 
  update(threshold = threshold(c(0.001, 0.5)))

Tuning the probability threshold

Nearly the same code as before:


ctrl <- control_grid(save_pred = TRUE, save_workflow = TRUE)
cls_mtr <- metric_set(brier_class, roc_auc, sensitivity, specificity)

set.seed(12)
nnet_thrsh_res <-
  nnet_thrsh_wflow |>
  tune_grid(
    resamples = sim_rs,
    grid = 25,
    param_info = nnet_thrsh_param, 
    control = ctrl,
    metrics = cls_mtr
  )

Grid results

autoplot(nnet_thrsh_res)

Grid results

  • tanh activation is doing much better.
  • threshold should not (and does not) affect the Brier or ROC metrics.
  • We can achieve low Brier scores.
  • We could run another grid with values <2% for a better threshold estimate.

Multimetric optimization

nnet_thrsh_res |>
  show_best_desirability(
    maximize(sensitivity),
    minimize(brier_class),
    constrain(specificity, low = 0.8, high = 1.0)
  ) |>
  relocate(threshold, sensitivity, specificity, brier_class, .d_overall)
#> # A tibble: 5 × 14
#>   threshold sensitivity specificity brier_class .d_overall hidden_units  penalty
#>       <dbl>       <dbl>       <dbl>       <dbl>      <dbl>        <int>    <dbl>
#> 1    0.0218       0.935       0.870      0.0449      0.948           36 2.61e- 5
#> 2    0.250        0.828       0.951      0.0418      0.902           28 2.61e-10
#> 3    0.209        0.833       0.936      0.0436      0.896            8 1   e-10
#> 4    0.188        0.833       0.934      0.0445      0.891           18 5.62e- 2
#> 5    0.0634       0.899       0.890      0.0527      0.884           40 2.15e- 2
#> # ℹ 7 more variables: activation <chr>, learn_rate <dbl>, .config <chr>,
#> #   roc_auc <dbl>, .d_max_sensitivity <dbl>, .d_min_brier_class <dbl>,
#> #   .d_box_specificity <dbl>

Calibration

more_sens <-
  nnet_thrsh_res |>
  select_best_desirability(
    maximize(sensitivity),
    minimize(brier_class),
    constrain(specificity, low = 0.8, high = 1.0)
  )

nnet_thrsh_res |>
  collect_predictions(
    parameters = more_sens
  ) |>
  cal_plot_windowed(
    truth = class,
    estimate = .pred_event,
    window_size = 0.2,
    step_size = 0.025,
  )

Thoughts about these results

The calibration issue in the previous plot shows that some very likely non-events will have underestimated probabilities.

  • That may not matter if we are very focused on events.
  • Thresholding does not affect calibration.
  • We might be able to:
    • Further tune the neural network to solve the issue and/or
    • Add a calibration postprocessor

Thoughts about the approach

Let’s say that we pick a threshold of 2%. Our explanation to the user/stakeholder would be

“As long as the model is at least 2% sure it is an event, we will call it an event”.

It may be challenging to convince someone that this is the best option.


That said, this is probably a better approach than cost-sensitive learning.

Fitting a postprocessor

Data to train the adjustments

If an adjustment requires data, where do we get it from?

  • Fitting a calibration model to the training set re-predictions would be bad.

  • Also, we don’t want to touch the validation or test sets.


We need another data set.

Data sources

Two possibilities:

  1. Shave some data off the training set to create a calibration set.
    • During resampling, we can do the same to the analysis set.
    • The “shaving” process emulates the original sampling method.
    • There are fewer data points for training the preprocessor and primary model.
  2. Use a static calibration set outside our training/validation/testing splits.

Currently, we have implemented the first method.

Example: 3-fold CV

Example: 3-fold CV internal split

Breakdown for the class imbalance data


Data Strategy event no_event
Original All 225 1775
Training No Calibration 168 1331
Analysis No Calibration 151 1197
Training Calibration 126 998
Analysis Calibration 135 1077
Calibration Calibration 16 120

Calibration

Calibration models, in essence, try to predict the true class using the model predictions. Symbolically:

  • For regression models: outcome ~ .pred
  • For binary classifiers: class ~ .pred_class_1
  • For multiclass: class ~ .pred_class_1 + .pred_class_2 + ...

Each calibration method works slightly differently.

For example, in regression, a (generalized) linear model is fit, and the residuals are added to new predictions.

Calibration expectations

Keep expectations low. For these methods to work:

  • The systematic issue will need to be large, or at least not subtle.
  • A large calibration set is needed to work effectively.

The example in ALM4TD is an illustrative example and details.

In many cases, trying a different model or tuning parameters would be better.


You can tune the calibration method, one of which is no calibration.

Your turn

  • Based on previous results, choose and fix a specific activation type (i.e., no tune()).
  • Add a calibrator to your tailor with a method = tune() value.
  • Run another grid search

Does it help with this data set?


10:00