6 - Tuning Hyperparameters

Machine learning with tidymodels

Tuning parameters

Some model or preprocessing parameters cannot be estimated directly from the data.

Some examples:

  • Tree depth in decision trees
  • Number of neighbors in a K-nearest neighbor model

Activation function in neural networks?

Sigmoidal functions, ReLu, etc.

Yes, it is a tuning parameter. ✅

Number of PCA components to retain?

Yes, it is a tuning parameter. ✅

Bayesian priors for model parameters?

Hmmmm, probably not. These are based on prior belief. ❌

Covariance/correlation matrix structure in mixed models?

Yes, but it is unlikely to affect performance.

It will impact inference though. 🤔

Is the random seed a tuning parameter?

Nope. It is not. ❌

Optimize tuning parameters

  • Try different values and measure their performance.
  • Find good values for these parameters.
  • Once the value(s) of the parameter(s) are determined, a model can be finalized by fitting the model to the entire training set.

Optimize tuning parameters

The main two strategies for optimization are:

  • Grid search 💠 which tests a pre-defined set of candidate values

  • Iterative search 🌀 which suggests/estimates new values of candidate parameters to evaluate

Choosing tuning parameters

Let’s take our previous recipe and add a few changes:

glm_rec <-
  recipe(on_goal ~ ., data = nhl_train) %>%
  step_lencode_mixed(shooter, goaltender, outcome = vars(on_goal)) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_mutate(
    angle = abs( atan2(abs(coord_y), (89 - coord_x) ) * (180 / pi) ),
    defensive_zone = ifelse(coord_x <= -25.5, 1, 0),
    behind_goal_line = ifelse(coord_x >= 89, 1, 0)
  ) %>%
  step_zv(all_predictors()) %>%
  step_ns(angle, deg_free = tune("angle")) %>%
  step_ns(coord_x, deg_free = tune("coord_x")) %>%
  step_normalize(all_numeric_predictors())

Let’s tune() our spline terms!

Choosing tuning parameters

Let’s take our previous recipe and add a few changes:

glm_spline_wflow <-
  workflow() %>%
  add_model(logistic_reg()) %>%
  add_recipe(glm_rec)

Splines and nonlinear relationships

Create a grid

glm_spline_wflow %>% 
  extract_parameter_set_dials()
#> Collection of 2 parameters for tuning
#> 
#>  identifier     type    object
#>       angle deg_free nparam[+]
#>     coord_x deg_free nparam[+]

A parameter set can be updated (e.g. to change the ranges).

Create a grid

set.seed(12)
grid <- 
  glm_spline_wflow %>% 
  extract_parameter_set_dials() %>% 
  grid_latin_hypercube(size = 25)

grid
#> # A tibble: 25 × 2
#>    angle coord_x
#>    <int>   <int>
#>  1     5       8
#>  2     1      13
#>  3     8       6
#>  4    11       5
#>  5     4       1
#>  6     6      10
#>  7    10      15
#>  8    12      13
#>  9     8      13
#> 10    10       6
#> # ℹ 15 more rows
  • A space-filling design like this tends to perform better than random grids.
  • Space-filling designs are also usually more efficient than regular grids.

Your turn

Create a grid for our tunable workflow.

Try creating a regular grid.

03:00

Create a grid

set.seed(12)
grid <- 
  glm_spline_wflow %>% 
  extract_parameter_set_dials() %>% 
  grid_regular(levels = 25)

grid
#> # A tibble: 225 × 2
#>    angle coord_x
#>    <int>   <int>
#>  1     1       1
#>  2     2       1
#>  3     3       1
#>  4     4       1
#>  5     5       1
#>  6     6       1
#>  7     7       1
#>  8     8       1
#>  9     9       1
#> 10    10       1
#> # ℹ 215 more rows

Update parameter ranges

set.seed(12)
grid <- 
  glm_spline_wflow %>% 
  extract_parameter_set_dials() %>% 
  update(angle = spline_degree(c(2L, 50L)),
         coord_x = spline_degree(c(2L, 50L))) %>% 
  grid_latin_hypercube(size = 25)

grid
#> # A tibble: 25 × 2
#>    angle coord_x
#>    <int>   <int>
#>  1    14      27
#>  2     4      42
#>  3    26      20
#>  4    36      16
#>  5    13       3
#>  6    20      33
#>  7    31      49
#>  8    40      44
#>  9    24      45
#> 10    34      18
#> # ℹ 15 more rows

The results

grid %>% 
  ggplot(aes(angle, coord_x)) +
  geom_point(size = 4)

Use the tune_*() functions to tune models

Your turn

Tune our glm_wflow.

What happens if you don’t supply a grid argument to tune_grid()?

05:00

Grid results

autoplot(glm_spline_res)

Tuning results

collect_metrics(glm_spline_res)
#> # A tibble: 50 × 8
#>    angle coord_x .metric  .estimator  mean     n std_err .config              
#>    <int>   <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
#>  1    14      27 accuracy binary     0.804     1      NA Preprocessor01_Model1
#>  2    14      27 roc_auc  binary     0.815     1      NA Preprocessor01_Model1
#>  3     4      42 accuracy binary     0.808     1      NA Preprocessor02_Model1
#>  4     4      42 roc_auc  binary     0.820     1      NA Preprocessor02_Model1
#>  5    26      20 accuracy binary     0.805     1      NA Preprocessor03_Model1
#>  6    26      20 roc_auc  binary     0.819     1      NA Preprocessor03_Model1
#>  7    36      16 accuracy binary     0.800     1      NA Preprocessor04_Model1
#>  8    36      16 roc_auc  binary     0.817     1      NA Preprocessor04_Model1
#>  9    13       3 accuracy binary     0.803     1      NA Preprocessor05_Model1
#> 10    13       3 roc_auc  binary     0.807     1      NA Preprocessor05_Model1
#> # ℹ 40 more rows

Tuning results

collect_metrics(glm_spline_res, summarize = FALSE)
#> # A tibble: 50 × 7
#>    id         angle coord_x .metric  .estimator .estimate .config              
#>    <chr>      <int>   <int> <chr>    <chr>          <dbl> <chr>                
#>  1 validation    14      27 accuracy binary         0.804 Preprocessor01_Model1
#>  2 validation    14      27 roc_auc  binary         0.815 Preprocessor01_Model1
#>  3 validation     4      42 accuracy binary         0.808 Preprocessor02_Model1
#>  4 validation     4      42 roc_auc  binary         0.820 Preprocessor02_Model1
#>  5 validation    26      20 accuracy binary         0.805 Preprocessor03_Model1
#>  6 validation    26      20 roc_auc  binary         0.819 Preprocessor03_Model1
#>  7 validation    36      16 accuracy binary         0.800 Preprocessor04_Model1
#>  8 validation    36      16 roc_auc  binary         0.817 Preprocessor04_Model1
#>  9 validation    13       3 accuracy binary         0.803 Preprocessor05_Model1
#> 10 validation    13       3 roc_auc  binary         0.807 Preprocessor05_Model1
#> # ℹ 40 more rows

Choose a parameter combination

show_best(glm_spline_res, metric = "roc_auc")
#> # A tibble: 5 × 8
#>   angle coord_x .metric .estimator  mean     n std_err .config              
#>   <int>   <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1     4      42 roc_auc binary     0.820     1      NA Preprocessor02_Model1
#> 2    26      20 roc_auc binary     0.819     1      NA Preprocessor03_Model1
#> 3    36      16 roc_auc binary     0.817     1      NA Preprocessor04_Model1
#> 4    40      44 roc_auc binary     0.817     1      NA Preprocessor08_Model1
#> 5    37      27 roc_auc binary     0.816     1      NA Preprocessor15_Model1

Choose a parameter combination

Create your own tibble for final parameters or use one of the tune::select_*() functions:

select_best(glm_spline_res, metric = "roc_auc")
#> # A tibble: 1 × 3
#>   angle coord_x .config              
#>   <int>   <int> <chr>                
#> 1     4      42 Preprocessor02_Model1

This best result has:

  • low-degree spline for angle (less “wiggly”, less complex)
  • higher-degree spline for coord_x (more “wiggly”, more complex)

Boosted trees 🌳🌲🌴🌵🌴🌳🌳🌴🌲🌵🌴🌲🌳🌴🌳🌵🌵🌴🌲🌲🌳🌴🌳🌴🌲🌴🌵🌴🌲🌴🌵🌲🌵🌴🌲🌳🌴🌵🌳🌴🌳

Boosted trees 🌳🌲🌴🌵🌳🌳🌴🌲🌵🌴🌳🌵

  • Ensemble many decision tree models

Review how a decision tree model works:

  • Series of splits or if/then statements based on predictors

  • First the tree grows until some condition is met (maximum depth, no more data)

  • Then the tree is pruned to reduce its complexity

Single decision tree

Boosted trees 🌳🌲🌴🌵🌳🌳🌴🌲🌵🌴🌳🌵

Boosting methods fit a sequence of tree-based models.

  • Each tree is dependent on the one before and tries to compensate for any poor results in the previous trees.

  • This is like gradient-based steepest ascent methods from calculus.

Boosted tree tuning parameters

Most modern boosting methods have a lot of tuning parameters!

  • For tree growth and pruning (min_n, max_depth, etc)

  • For boosting (trees, stop_iter, learn_rate)

We’ll use early stopping to stop boosting when a few iterations produce consecutively worse results.

Comparing tree ensembles

Random forest

  • Independent trees
  • Bootstrapped data
  • No pruning
  • 1000’s of trees

Boosting

  • Dependent trees
  • Different case weights
  • Tune tree parameters
  • Far fewer trees

The general consensus for tree-based models is, in terms of performance: boosting > random forest > bagging > single trees.

Boosted tree code

xgb_spec <-
  boost_tree(
    trees = tune(), min_n = tune(), tree_depth = tune(),
    learn_rate = tune(), loss_reduction = tune()
  ) %>%
  set_mode("classification") %>% 
  set_engine("xgboost") 

xgb_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_lencode_mixed(shooter, goaltender, outcome = vars(on_goal)) %>% 
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

xgb_wflow <- 
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(xgb_rec)

Your turn

Create your boosted tree workflow.

03:00

Running in parallel

  • Grid search, combined with resampling, requires fitting a lot of models!

  • These models don’t depend on one another and can be run in parallel.

We can use a parallel backend to do this:

cores <- parallelly::availableCores(logical = FALSE)
cl <- parallel::makePSOCKcluster(cores)
doParallel::registerDoParallel(cl)

# Now call `tune_grid()`!

# Shut it down with:
foreach::registerDoSEQ()
parallel::stopCluster(cl)

Running in parallel

Speed-ups are fairly linear up to the number of physical cores (10 here).

Tuning

This will take some time to run ⏳

set.seed(9)

xgb_res <-
  xgb_wflow %>%
  tune_grid(resamples = nhl_val, grid = 30, control = ctrl) # automatic grid now!

Your turn

Start tuning the boosted tree model!

We won’t wait for everyone’s tuning to finish, but take this time to get it started before we move on.

03:00

Tuning results

xgb_res
#> # Tuning results
#> # Validation Set Split (0.8/0.2)  
#> # A tibble: 1 × 5
#>   splits              id         .metrics          .notes           .predictions          
#>   <list>              <chr>      <list>            <list>           <list>                
#> 1 <split [5348/1338]> validation <tibble [60 × 9]> <tibble [0 × 3]> <tibble [40,140 × 11]>

Tuning results

autoplot(xgb_res)

Again with the location features

coord_rec <- 
  xgb_rec %>%
  step_mutate(
    angle = abs( atan2(abs(coord_y), (89 - coord_x) ) * (180 / pi) ),
    defensive_zone = ifelse(coord_x <= -25.5, 1, 0),
    behind_goal_line = ifelse(coord_x >= 89, 1, 0)
  )

xgb_coord_wflow <- 
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(coord_rec)

set.seed(9)
xgb_coord_res <-
  xgb_coord_wflow %>%
  tune_grid(resamples = nhl_val, grid = 30, control = ctrl)

Did the machine figure it out?

No extra features:

show_best(xgb_res, metric = "roc_auc", n = 3)
#> # A tibble: 3 × 11
#>   trees min_n tree_depth learn_rate loss_reduction .metric .estimator  mean     n std_err .config              
#>   <int> <int>      <int>      <dbl>          <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1  1697     3          3    0.0116       0.674     roc_auc binary     0.804     1      NA Preprocessor1_Model02
#> 2  1264    16          2    0.00732      0.000340  roc_auc binary     0.803     1      NA Preprocessor1_Model12
#> 3   569    39          4    0.0272       0.0000288 roc_auc binary     0.800     1      NA Preprocessor1_Model30

With additional coordinate features:

show_best(xgb_coord_res, metric = "roc_auc", n = 3)
#> # A tibble: 3 × 11
#>   trees min_n tree_depth learn_rate loss_reduction .metric .estimator  mean     n std_err .config              
#>   <int> <int>      <int>      <dbl>          <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1  1697     3          3    0.0116    0.674        roc_auc binary     0.807     1      NA Preprocessor1_Model02
#> 2  1264    16          2    0.00732   0.000340     roc_auc binary     0.804     1      NA Preprocessor1_Model12
#> 3   427    31          2    0.249     0.0000000773 roc_auc binary     0.803     1      NA Preprocessor1_Model23

Compare models

Best logistic regression results:

glm_spline_res %>% 
  show_best(metric = "roc_auc", n = 1) %>% 
  select(.metric, .estimator, mean, n, std_err, .config)
#> # A tibble: 1 × 6
#>   .metric .estimator  mean     n std_err .config              
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1 roc_auc binary     0.820     1      NA Preprocessor02_Model1

Best boosting results:

xgb_res %>% 
  show_best(metric = "roc_auc", n = 1) %>% 
  select(.metric, .estimator, mean, n, std_err, .config)
#> # A tibble: 1 × 6
#>   .metric .estimator  mean     n std_err .config              
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1 roc_auc binary     0.804     1      NA Preprocessor1_Model02

Updating the workflow

best_auc <- select_best(glm_spline_res, metric = "roc_auc")
best_auc
#> # A tibble: 1 × 3
#>   angle coord_x .config              
#>   <int>   <int> <chr>                
#> 1     4      42 Preprocessor02_Model1

glm_spline_wflow <-
  glm_spline_wflow %>% 
  finalize_workflow(best_auc)

glm_spline_wflow
#> ══ Workflow ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> 7 Recipe Steps
#> 
#> • step_lencode_mixed()
#> • step_dummy()
#> • step_mutate()
#> • step_zv()
#> • step_ns()
#> • step_ns()
#> • step_normalize()
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

The final fit to the NHL data

test_res <- 
  glm_spline_wflow %>% 
  last_fit(split = nhl_split)

test_res
#> # Resampling results
#> # Manual resampling 
#> # A tibble: 1 × 6
#>   splits              id               .metrics         .notes           .predictions         .workflow 
#>   <list>              <chr>            <list>           <list>           <list>               <list>    
#> 1 <split [6686/2229]> train/test split <tibble [2 × 4]> <tibble [0 × 3]> <tibble [2,229 × 6]> <workflow>

Remember that last_fit() fits one time with the combined training and validation set, then evaluates one time with the testing set.

Your turn

Finalize your workflow with the best parameters.

Create a final fit.

08:00

Estimates of ROC AUC

Validation results from tuning:

glm_spline_res %>% 
  show_best(metric = "roc_auc", n = 1) %>% 
  select(.metric, mean, n, std_err)
#> # A tibble: 1 × 4
#>   .metric  mean     n std_err
#>   <chr>   <dbl> <int>   <dbl>
#> 1 roc_auc 0.820     1      NA

Test set results:

test_res %>% collect_metrics()
#> # A tibble: 2 × 4
#>   .metric  .estimator .estimate .config             
#>   <chr>    <chr>          <dbl> <chr>               
#> 1 accuracy binary         0.800 Preprocessor1_Model1
#> 2 roc_auc  binary         0.807 Preprocessor1_Model1

Final fitted workflow

Extract the final fitted workflow, fit using the training set:

final_glm_spline_wflow <- 
  test_res %>% 
  extract_workflow()

# use this object to predict or deploy
predict(final_glm_spline_wflow, nhl_test[1:3,])
#> # A tibble: 3 × 1
#>   .pred_class
#>   <fct>      
#> 1 yes        
#> 2 yes        
#> 3 no

Next steps

  • Use explainers to characterize the model and the predictions.

Explain yourself

There are two categories of model explanations, global and local.

  • Global model explanations provide an overall understanding aggregated over a whole set of observations.

  • Local model explanations provide information about a prediction for a single observation.

You can also build global model explanations by aggregating local model explanations.

tidymodels integrates with model explainability frameworks

A tidymodels explainer

We can build explainers using:

  • original, basic predictors
  • derived features
library(DALEXtra)

glm_explainer <- explain_tidymodels(
  final_glm_spline_wflow,
  data = dplyr::select(nhl_train, -on_goal),
  # DALEX required an integer for factors:
  y = as.integer(nhl_train$on_goal) - 1,
  verbose = FALSE
)

Explain the x coordinates

With our explainer, let’s create partial dependence profiles:

set.seed(123)
pdp_coord_x <- model_profile(
  glm_explainer,
  variables = "coord_x",
  N = 500,
  groups = "strength"
)

You can use the default plot() method or create your own visualization.

Explain the x coordinates

Explain the x coordinates

Your turn

Create an explainer for our glm model.

Try using other variables, like extra_attacker or game_seconds.

05:00