Getting More Out of Feature Engineering and Tuning for Machine Learning
# Load our example data for this section
"https://raw.githubusercontent.com/tidymodels/" |>
paste0("workshops/main/slides/class_data.RData") |>
url() |>
load()
set.seed(429)
sim_split <- initial_split(class_data, prop = 0.75, strata = class)
sim_train <- training(sim_split)
sim_test <- testing(sim_split)
set.seed(523)
sim_rs <- vfold_cv(sim_train, v = 10, strata = class)
From the seventh advanced slide deck:
nnet_spec <-
mlp(hidden_units = tune(), penalty = tune(), learn_rate = tune(),
epochs = 100, activation = tune()
) |>
set_engine("brulee", stop_iter = 10) |>
set_mode("classification")
rec <-
recipe(class ~ ., data = sim_train) |>
step_normalize(all_numeric_predictors())
thrsh_tlr <-
tailor() |>
adjust_probability_threshold(threshold = tune())
nnet_wflow <- workflow(rec, nnet_spec, thrsh_tlr)
nnet_param <-
nnet_wflow |>
extract_parameter_set_dials() |>
update(threshold = threshold(c(0.0001, 0.1)))
cls_mtr <- metric_set(brier_class, roc_auc, sensitivity, specificity)
Instead of pre-defining a grid of candidate points, we can model our current results to predict what the next candidate point should be.
Suppose that we are only tuning the learning rate in our neural network.
We could do something like:
and use this to predict and rank new learning rate candidates.
A linear model probably isn’t the best choice though (more in a minute).
To illustrate the process, we resampled a large grid of learning rate values for our data to show what the relationship is between error and learning rate.
Now suppose that we used a grid of three points in the parameter range for learning rate…
We can make a “meta-model” with a small set of historical performance results.
Gaussian Processes (GP) models are a good choice to model performance.
\[\operatorname{cov}(\boldsymbol{x}_i, \boldsymbol{x}_j) = \exp\left(-\frac{1}{2}|\boldsymbol{x}_i - \boldsymbol{x}_j|^2\right) + \sigma^2_{ij}\]
The GP model can take candidate tuning parameter combinations as inputs and make predictions for performance (e.g. Brier, ROC AUC, RMSE, etc.)
The variance is mostly driven by spatial variability (the previous equation).
The predicted variance is zero at locations of actual data points and becomes very high when far away from any observed data.
Your GP makes predictions on two new candidate tuning parameters.
We want to minimize error.
Which should we choose?
03:00
This isn’t a very good fit but we can still use it.
How can we use the outputs to choose the next point to measure?
Acquisition functions take the predicted mean and variance and use them to balance:
Exploration focuses on the variance, exploitation is about the mean.
We’ll use an acquisition function to select a new candidate.
The most popular method appears to be expected improvement (EI) above the current best results.
We would probably pick the point with the largest EI as the next point.
(There are other functions beyond EI.)
Once we pick the candidate point, we measure performance for it (e.g. resampling).
Another GP is fit, EI is recomputed, and so on.
We stop when we have completed the allowed number of iterations or if we don’t see any improvement after a pre-set number of attempts.
We’ll use a function called tune_bayes()
that has very similar syntax to tune_grid()
.
It has an additional initial
argument for the initial set of performance estimates and parameter combinations for the GP model.
initial
can be the results of another tune_*()
function or an integer (in which case tune_grid()
is used under to hood to make such an initial set of results).
We’ll run the optimization more than once, so let’s make an initial grid of results to serve as the substrate for the BO.
I suggest at least the number of tuning parameters plus two as the initial grid for BO.
What about non-numeric tuning parameters such as activation
?
Currently, tidymodels converts these to dummy indicators and uses those in the GP. This is not unusual but also not great.
An upcoming version of tune will use a different R package to fit the GP that uses factor or Gower kernels. This will avoid making indicators and require fewer initial points.
set.seed(12)
init_res <-
nnet_wflow |>
tune_grid(
resamples = sim_rs,
grid = nrow(nnet_param) + 6, # for activation values + 1 extra
param_info = nnet_param,
metrics = cls_mtr
)
show_best(init_res, metric = "brier_class", n = 3) |> select(-.metric, -.estimator)
#> # A tibble: 3 × 9
#> hidden_units penalty activation learn_rate threshold mean n std_err
#> <int> <dbl> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 30 0.001 log_sigmo… 0.331 0.0101 0.0410 10 0.00235
#> 2 6 0.0001 elu 0.0251 0.0800 0.0434 10 0.00173
#> 3 40 0.0000000001 elu 0.0132 0.0600 0.0453 10 0.00358
#> # ℹ 1 more variable: .config <chr>
ctrl_bo <- control_bayes(verbose_iter = TRUE, no_improve = Inf)
set.seed(125)
nnet_bayes_res <-
nnet_wflow |>
tune_bayes(
resamples = sim_rs,
initial = init_res, # <- initial results
iter = 25,
control = ctrl_bo,
param_info = nnet_param,
metrics = cls_mtr
)
#> Optimizing brier_class using the expected improvement
#>
#> ── Iteration 1 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=3, penalty=2.36e-06, activation=log_sigmoid, learn_rate=0.616,
#> threshold=0.09
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.06833 (+/-0.00667)
#>
#> ── Iteration 2 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=38, penalty=3.22e-05, activation=log_sigmoid,
#> learn_rate=0.578, threshold=0.0678
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04191 (+/-0.0022)
#>
#> ── Iteration 3 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=23, penalty=1.47e-08, activation=elu, learn_rate=0.348,
#> threshold=0.0657
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04808 (+/-0.00476)
#>
#> ── Iteration 4 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=19, penalty=0.125, activation=elu, learn_rate=0.00122,
#> threshold=0.0282
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.05124 (+/-0.00418)
#>
#> ── Iteration 5 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=25, penalty=1.15e-10, activation=tanh, learn_rate=0.587,
#> threshold=0.00897
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.05473 (+/-0.00369)
#>
#> ── Iteration 6 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=42, penalty=1.64e-05, activation=log_sigmoid,
#> learn_rate=0.134, threshold=0.0313
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04167 (+/-0.00294)
#>
#> ── Iteration 7 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=2, penalty=1.44e-05, activation=elu, learn_rate=0.592,
#> threshold=0.0935
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.05835 (+/-0.00898)
#>
#> ── Iteration 8 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=23, penalty=0.34, activation=elu, learn_rate=0.0307,
#> threshold=0.0422
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04357 (+/-0.00432)
#>
#> ── Iteration 9 ───────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=50, penalty=3.11e-05, activation=elu, learn_rate=0.443,
#> threshold=0.0791
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04494 (+/-0.00374)
#>
#> ── Iteration 10 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=35, penalty=2.68e-07, activation=log_sigmoid, learn_rate=0.3,
#> threshold=0.0283
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04413 (+/-0.00326)
#>
#> ── Iteration 11 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=48, penalty=0.912, activation=log_sigmoid, learn_rate=0.158,
#> threshold=0.000413
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.0424 (+/-0.00356)
#>
#> ── Iteration 12 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=36, penalty=0.474, activation=log_sigmoid, learn_rate=0.32,
#> threshold=0.0632
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04258 (+/-0.00239)
#>
#> ── Iteration 13 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=43, penalty=0.000343, activation=log_sigmoid,
#> learn_rate=0.419, threshold=0.00878
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04314 (+/-0.00215)
#>
#> ── Iteration 14 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=18, penalty=1.69e-09, activation=elu, learn_rate=0.0127,
#> threshold=0.043
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04498 (+/-0.00399)
#>
#> ── Iteration 15 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=27, penalty=8.91e-05, activation=log_sigmoid, learn_rate=0.63,
#> threshold=0.063
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04534 (+/-0.0035)
#>
#> ── Iteration 16 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=49, penalty=1.11e-05, activation=log_sigmoid,
#> learn_rate=0.0964, threshold=0.097
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04303 (+/-0.00264)
#>
#> ── Iteration 17 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=27, penalty=0.000519, activation=elu, learn_rate=0.0288,
#> threshold=0.0952
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04144 (+/-0.00259)
#>
#> ── Iteration 18 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=40, penalty=0.00206, activation=log_sigmoid, learn_rate=0.159,
#> threshold=0.01
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04244 (+/-0.00267)
#>
#> ── Iteration 19 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=48, penalty=9.79e-10, activation=elu, learn_rate=0.0306,
#> threshold=0.0726
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04707 (+/-0.00327)
#>
#> ── Iteration 20 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=46, penalty=4.16e-05, activation=log_sigmoid,
#> learn_rate=0.00323, threshold=0.0303
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.1023 (+/-0.0122)
#>
#> ── Iteration 21 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=27, penalty=0.0042, activation=elu, learn_rate=0.0214,
#> threshold=0.0673
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04396 (+/-0.00307)
#>
#> ── Iteration 22 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=20, penalty=0.101, activation=log_sigmoid, learn_rate=0.0933,
#> threshold=0.0109
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.05079 (+/-0.00509)
#>
#> ── Iteration 23 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=45, penalty=1.9e-10, activation=elu, learn_rate=0.153,
#> threshold=0.0949
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04858 (+/-0.0047)
#>
#> ── Iteration 24 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=46, penalty=4e-10, activation=elu, learn_rate=0.00275,
#> threshold=0.0375
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.04534 (+/-0.0042)
#>
#> ── Iteration 25 ──────────────────────────────────────────────────────
#>
#> i Current best: brier_class=0.04097 (@iter 0)
#> i Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i hidden_units=21, penalty=7.3e-10, activation=elu, learn_rate=0.0281,
#> threshold=0.0805
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: brier_class=0.0431 (+/-0.00355)
show_best(nnet_bayes_res, metric = "brier_class") |> select(-.metric, -.estimator)
#> # A tibble: 5 × 10
#> hidden_units penalty activation learn_rate threshold mean n std_err
#> <int> <dbl> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 30 0.001 log_sigmoid 0.331 0.0101 0.0410 10 0.00235
#> 2 27 0.000519 elu 0.0288 0.0952 0.0414 10 0.00259
#> 3 42 0.0000164 log_sigmoid 0.134 0.0313 0.0417 10 0.00294
#> 4 38 0.0000322 log_sigmoid 0.578 0.0678 0.0419 10 0.00220
#> 5 48 0.912 log_sigmoid 0.158 0.000413 0.0424 10 0.00356
#> # ℹ 2 more variables: .config <chr>, .iter <int>
Let’s try a different acquisition function: conf_bound(kappa)
.
We’ll use the objective
argument to set it.
Choose your own kappa
value:
Bonus points: Before the optimization is done, press <esc>
and see what happens.
10:00
Stopping tune_bayes()
will return the current results.
Parallel processing can still be used to more efficiently measure each candidate point.
There are a lot of other iterative methods that you can use.
The finetune package also has functions for simulated annealing search.