3 - Racing

Getting More Out of Feature Engineering and Tuning for Machine Learning

Startup!

library(tidymodels)
library(finetune)
library(bonsai)

tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)

More startup!

# Load our example data for this section
"https://raw.githubusercontent.com/tidymodels/" |> 
  paste0("workshops/main/slides/class_data.RData") |> 
  url() |> 
  load()

set.seed(429)
sim_split <- initial_split(class_data, prop = 0.75, strata = class)
sim_train <- training(sim_split)
sim_test  <- testing(sim_split)

set.seed(523)
sim_rs <- vfold_cv(sim_train, v = 10, strata = class)

First, a shameless promotion

Efficient Grid Search

Making Grid Search More Efficient

Previously, we evaluated 250 models (25 candidates times 10 resamples).

We can make this go faster using parallel processing.

Also, for some models, we can fit far fewer models than the number being evaluated.

For example, with boosted trees, a model with X trees can often predict on candidates with fewer than X trees (i.e., no retraining).

These strategies can lead to enormous speed-ups.

Model Racing

Racing is an old tool that we can use to go even faster.

Evaluate all of the candidate models, but only for a few resamples.
Determine which candidates have a low probability of being selected (cough, cough, tanh activation, cough).
Eliminate poor candidates.
Repeat with next resample (until no more resamples remain).

This can result in fitting a small number of models.

It is not an iterative search; it is an adaptive grid search.

TMwR, TMwR example, AML4TD

Discarding Candidates

How do we eliminate tuning parameter combinations?

There are a few methods to do so. We’ll use one based on analysis of variance (ANOVA).

However… there is typically a large resampling effect in the results.

Resampling Results (Non-Racing)

Here are some realistic (but simulated) examples of two candidate models.

An error estimate is measured for each of 10 resamples.

The lines connect resamples.

There is usually a significant resample-to-resample effect (rank corr: 0.83).

Are Candidates Different?

One way to evaluate these models is to do a paired t-test

or a t-test on their differences matched by resamples

With \(n = 10\) resamples, the confidence interval for the difference in the model error is (0.99, 2.8), indicating that candidate number 2 has a smaller error.

Evaluating Differences in Candidates

What if we were to have compared the candidates while we sequentially evaluated each resample?

👉

One candidate shows superiority when 5 resamples have been evaluated.

Interim Analysis of Results

One version of racing uses a mixed model ANOVA to construct one-sided confidence intervals for each candidate versus the current best.

Any candidates whose bound does not include zero are discarded. Here is an animation.

The resamples are analyzed in a random order (so set the seed).

Kuhn (2014) has examples and simulations to show that the method works.

The finetune package has functions tune_race_anova() and tune_race_win_loss().

Boosted Trees

These are popular ensemble methods that build a sequence of tree models.

Each tree uses the results of the previous tree to better predict samples, especially those that have been poorly predicted.

Each tree in the ensemble is saved, and new samples are predicted using a weighted average of its votes.

We’ll focus on the popular lightgbm implementation.

Boosted Tree Tuning Parameters

Some possible parameters:

mtry: The number of predictors randomly sampled at each split (in \([1, ncol(x)]\) or \((0, 1]\)).
trees: The number of trees (\([1, \infty]\), but usually up to thousands).
min_n: The number of samples needed to further split (\([1, n]\)).
learn_rate: The rate that each tree adapts from previous iterations (\((0, \infty]\), usual maximum is 0.1).
stop_iter: The number of iterations of boosting where no improvement was shown before stopping (\([1, trees]\)).

Boosted Tree Tuning Parameters

TBH, it is usually not difficult to optimize these models.

Often, there are multiple candidate tuning parameter regions with very good results.

For example: 👉

To demonstrate, we’ll look at optimizing five of the tuning parameters.

Boosted Tree Tuning Parameters

We’ll need to load the bonsai package. This has the information needed to use lightgbm

library(bonsai)

lgbm_spec <-
  boost_tree(
    trees = tune(),
    learn_rate = tune(),
    mtry = tune(),
    min_n = tune(),
    stop_iter = tune()
  ) |>
  set_mode("classification") |>
  # Turn off within-tree parallel processing; it's faster to run 
  # the resamples/configurations in parallel
  set_engine("lightgbm", num_threads = 1) 

# No preprocessing required:
lgbm_wflow <- workflow(class ~ ., lgbm_spec)

Racing our boosted trees

Racing

library(finetune)

# Set this to true to demo
ctrl <- control_race(verbose_elim = FALSE)

# Optimizes on the first metric in the set
cls_mtr <- metric_set(brier_class, roc_auc, sensitivity, specificity)

mirai::daemons(parallel::detectCores() - 1)
#> [1] 13

set.seed(321)
lgbm_res <-
  lgbm_wflow |>
  tune_race_anova(              # <- very similar syntax to tune_grid()
    resamples = sim_rs,
    # Let's use a larger grid
    grid = 50,
    control = ctrl,
    metrics = cls_mtr
  )

Racing Results

show_best(lgbm_res, metric = "brier_class")
#> # A tibble: 1 × 11
#>    mtry trees min_n learn_rate stop_iter .metric .estimator   mean     n std_err
#>   <int> <int> <int>      <dbl>     <int> <chr>   <chr>       <dbl> <int>   <dbl>
#> 1     9  1836     9    0.00222         6 brier_… binary     0.0379    10 0.00238
#> # ℹ 1 more variable: .config <chr>

Times using 10 cores: sequential: 605s, parallel: 92s, and parallel racing: 50s.

Parallel was 6.6-fold faster, and racing in parallel was 12.3-fold faster.

Racing Results

Only 378 models were fit (out of 500).

select_best() never considers candidate models that did not get to the end of the race.

There is a helper function to see how candidate models were removed from consideration.

plot_race(lgbm_res)

Your turn

Run tune_race_anova() with a different seed and/or a different metric.
Did you get the same or similar results?

08:00