Getting More Out of Feature Engineering and Tuning for Machine Learning
# Load our example data for this section
"https://raw.githubusercontent.com/tidymodels/" |>
paste0("workshops/main/slides/class_data.RData") |>
url() |>
load()
set.seed(429)
sim_split <- initial_split(class_data, prop = 0.75, strata = class)
sim_train <- training(sim_split)
sim_test <- testing(sim_split)
set.seed(523)
sim_rs <- vfold_cv(sim_train, v = 10, strata = class)
rec <-
recipe(class ~ ., data = sim_train) |>
step_normalize(all_numeric_predictors())
nnet_spec <-
mlp(hidden_units = tune(), penalty = tune(), learn_rate = tune(),
epochs = 100, activation = tune()) |>
# Remove the class_weights argument
set_engine("brulee", stop_iter = 10) |>
set_mode("classification")
nnet_wflow <- workflow(rec, nnet_spec)
nnet_param <-
nnet_wflow |>
extract_parameter_set_dials()
How can we modify our predictions? Some examples:
* Requires further estimation (and data).
Let’s first consider the easiest case: alternative cutoffs.
Instead of up-weighting the samples in the minority (via class_weights
), we can try to fit the best model and then define what it means to be an “event.”
Instead of using a 50% threshold, we might lower the level of evidence needed to call a prediction an event.
How do we tune the threshold?
The tailor package is similar to recipes but specifies how to adjust predictions.
A simple example:
Like a recipe, this initial call doesn’t do anything but declare intent.
Unlike a recipe, it does not need the data (i.e., predictions) at this point.
fit()
is used (next slide).There is a fit()
method that requires data and the names of the prediction columns:
three_rows <-
tribble(
~ class, ~ .pred_class, ~.pred_event, ~.pred_nonevent,
"event", "event", 0.6, 0.4,
"event", "nonevent", 0.4, 0.6,
"nonevent", "nonevent", 0.1, 0.9
) |>
mutate(across(where(is.character), factor))
thrsh_fit <-
thrsh_tlr |>
fit(
three_rows,
outcome = class,
estimate = .pred_class,
.pred_event:.pred_nonevent # No argument name and order matches factor levels
)
predict()
applies the adjustments:
In practice, we would add the tailor to a workflow to make it easier to use:
fit()
and predict()
happen automatically.adjust_equivocal_zone()
: decline to predict.adjust_numeric_calibration()
: try to readjust predictions to be consistent with numeric predictions.adjust_numeric_range()
: restrict the range of predictions.adjust_predictions_custom()
: similar to dplyr::mutate()
.adjust_probability_calibration()
: try to readjust predictions to be consistent with probability predictions.adjust_probability_threshold()
: custom rule for hard-class predictions from probabilities.Adjustment order matters; tailor will error early if the ordering rules are violated.
Adjustments that change class probabilities also affect hard class predictions.
Adjustments happen before performance estimation.
We have more calibration methods in mind.
03:00
When estimation is required, the data considerations become more complex.
Most arguments can be tuned.
For grid search, we use a conditional execution algorithm that avoids redundant retraining of the preprocessor or model.
Nearly the same code as before:
tanh
activation is doing much better.threshold
should not (and does not) affect the Brier or ROC metrics.nnet_thrsh_res |>
show_best_desirability(
maximize(sensitivity),
minimize(brier_class),
constrain(specificity, low = 0.8, high = 1.0)
) |>
relocate(threshold, sensitivity, specificity, brier_class, .d_overall)
#> # A tibble: 5 × 14
#> threshold sensitivity specificity brier_class .d_overall hidden_units penalty
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 0.0218 0.935 0.870 0.0449 0.948 36 2.61e- 5
#> 2 0.250 0.828 0.951 0.0418 0.902 28 2.61e-10
#> 3 0.209 0.833 0.936 0.0436 0.896 8 1 e-10
#> 4 0.188 0.833 0.934 0.0445 0.891 18 5.62e- 2
#> 5 0.0634 0.899 0.890 0.0527 0.884 40 2.15e- 2
#> # ℹ 7 more variables: activation <chr>, learn_rate <dbl>, .config <chr>,
#> # roc_auc <dbl>, .d_max_sensitivity <dbl>, .d_min_brier_class <dbl>,
#> # .d_box_specificity <dbl>
more_sens <-
nnet_thrsh_res |>
select_best_desirability(
maximize(sensitivity),
minimize(brier_class),
constrain(specificity, low = 0.8, high = 1.0)
)
nnet_thrsh_res |>
collect_predictions(
parameters = more_sens
) |>
cal_plot_windowed(
truth = class,
estimate = .pred_event,
window_size = 0.2,
step_size = 0.025,
)
The calibration issue in the previous plot shows that some very likely non-events will have underestimated probabilities.
Let’s say that we pick a threshold of 2%. Our explanation to the user/stakeholder would be
“As long as the model is at least 2% sure it is an event, we will call it an event”.
It may be challenging to convince someone that this is the best option.
That said, this is probably a better approach than cost-sensitive learning.
If an adjustment requires data, where do we get it from?
Fitting a calibration model to the training set re-predictions would be bad.
Also, we don’t want to touch the validation or test sets.
We need another data set.
Two possibilities:
Currently, we have implemented the first method.
Data | Strategy | event | no_event |
---|---|---|---|
Original | All | 225 | 1775 |
Training | No Calibration | 168 | 1331 |
Analysis | No Calibration | 151 | 1197 |
Training | Calibration | 126 | 998 |
Analysis | Calibration | 135 | 1077 |
Calibration | Calibration | 16 | 120 |
Calibration models, in essence, try to predict the true class using the model predictions. Symbolically:
outcome ~ .pred
class ~ .pred_class_1
class ~ .pred_class_1 + .pred_class_2 + ...
Each calibration method works slightly differently.
For example, in regression, a (generalized) linear model is fit, and the residuals are added to new predictions.
Keep expectations low. For these methods to work:
The example in ALM4TD is an illustrative example and details.
In many cases, trying a different model or tuning parameters would be better.
You can tune the calibration method, one of which is no calibration.
tune()
).method = tune()
value.Does it help with this data set?
10:00