Getting More Out of Feature Engineering and Tuning for Machine Learning
# Load our example data for this section
"https://raw.githubusercontent.com/tidymodels/" |>
paste0("workshops/main/slides/class_data.RData") |>
url() |>
load()
set.seed(429)
sim_split <- initial_split(class_data, prop = 0.75, strata = class)
sim_train <- training(sim_split)
sim_test <- testing(sim_split)
set.seed(523)
sim_rs <- vfold_cv(sim_train, v = 10, strata = class)
Previously, we evaluated 250 models (25 candidates times 10 resamples).
We can make this go faster using parallel processing.
Also, for some models, we can fit far fewer models than the number being evaluated.
X
trees can often predict on candidates with fewer than X
trees (i.e., no retraining).These strategies can lead to enormous speed-ups.
Racing is an old tool that we can use to go even faster.
tanh
activation, cough).This can result in fitting a small number of models.
It is not an iterative search; it is an adaptive grid search.
How do we eliminate tuning parameter combinations?
There are a few methods to do so. We’ll use one based on analysis of variance (ANOVA).
However… there is typically a large resampling effect in the results.
Here are some realistic (but simulated) examples of two candidate models.
An error estimate is measured for each of 10 resamples.
There is usually a significant resample-to-resample effect (rank corr: 0.83).
One way to evaluate these models is to do a paired t-test
With \(n = 10\) resamples, the confidence interval for the difference in the model error is (0.99, 2.8), indicating that candidate number 2 has a smaller error.
What if we were to have compared the candidates while we sequentially evaluated each resample?
👉
One candidate shows superiority when 5 resamples have been evaluated.
One version of racing uses a mixed model ANOVA to construct one-sided confidence intervals for each candidate versus the current best.
Any candidates whose bound does not include zero are discarded. Here is an animation.
The resamples are analyzed in a random order (so set the seed).
Kuhn (2014) has examples and simulations to show that the method works.
The finetune package has functions tune_race_anova()
and tune_race_win_loss()
.
These are popular ensemble methods that build a sequence of tree models.
Each tree uses the results of the previous tree to better predict samples, especially those that have been poorly predicted.
Each tree in the ensemble is saved, and new samples are predicted using a weighted average of its votes.
We’ll focus on the popular lightgbm implementation.
Some possible parameters:
mtry
: The number of predictors randomly sampled at each split (in \([1, ncol(x)]\) or \((0, 1]\)).trees
: The number of trees (\([1, \infty]\), but usually up to thousands).min_n
: The number of samples needed to further split (\([1, n]\)).learn_rate
: The rate that each tree adapts from previous iterations (\((0, \infty]\), usual maximum is 0.1).stop_iter
: The number of iterations of boosting where no improvement was shown before stopping (\([1, trees]\)).TBH, it is usually not difficult to optimize these models.
Often, there are multiple candidate tuning parameter regions with very good results.
For example: 👉
To demonstrate, we’ll look at optimizing five of the tuning parameters.
We’ll need to load the bonsai package. This has the information needed to use lightgbm
library(bonsai)
lgbm_spec <-
boost_tree(
trees = tune(),
learn_rate = tune(),
mtry = tune(),
min_n = tune(),
stop_iter = tune()
) |>
set_mode("classification") |>
# Turn off within-tree parallel processing; it's faster to run
# the resamples/configurations in parallel
set_engine("lightgbm", num_threads = 1)
# No preprocessing required:
lgbm_wflow <- workflow(class ~ ., lgbm_spec)
library(finetune)
# Set this to true to demo
ctrl <- control_race(verbose_elim = FALSE)
# Optimizes on the first metric in the set
cls_mtr <- metric_set(brier_class, roc_auc, sensitivity, specificity)
mirai::daemons(parallel::detectCores() - 1)
set.seed(321)
lgbm_res <-
lgbm_wflow |>
tune_race_anova( # <- very similar syntax to tune_grid()
resamples = sim_rs,
# Let's use a larger grid
grid = 50,
control = ctrl,
metrics = cls_mtr
)
Times using 10 cores: sequential: 605s, parallel: 92s, and parallel racing: 50s.
Parallel was 6.6-fold faster, and racing in parallel was 12.3-fold faster.
tune_race_anova()
with a different seed and/or a different metric.08:00