6 - Tuning Hyperparameters

Machine learning with tidymodels

Tuning parameters

Some model or preprocessing parameters cannot be estimated directly from the data.

Some examples:

  • Tree depth in decision trees
  • Number of neighbors in a K-nearest neighbor model

Activation function in neural networks?

Sigmoidal functions, ReLu, etc.

Yes, it is a tuning parameter. βœ…

Number of PCA components to retain?

Yes, it is a tuning parameter. βœ…

Bayesian priors for model parameters?

Hmmmm, probably not. These are based on prior belief. ❌

Covariance/correlation matrix structure in mixed models?

Yes, but it is unlikely to affect performance.

It will impact inference though. πŸ€”

Is the random seed a tuning parameter?

Nope. It is not. ❌

Optimize tuning parameters

  • Try different values and measure their performance.
  • Find good values for these parameters.
  • Once the value(s) of the parameter(s) are determined, a model can be finalized by fitting the model to the entire training set.

Optimize tuning parameters

The main two strategies for optimization are:

  • Grid search πŸ’  which tests a pre-defined set of candidate values

  • Iterative search πŸŒ€ which suggests/estimates new values of candidate parameters to evaluate

Choosing tuning parameters

Let’s take our previous recipe and add a few changes:

glm_rec <-
  recipe(on_goal ~ ., data = nhl_train) %>%
  step_lencode_mixed(player, outcome = vars(on_goal)) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_mutate(
    angle = abs(atan2(abs(coord_y), (89 - abs(coord_x))) * (180 / pi)),
    distance = sqrt((89 - abs(coord_x))^2 + abs(coord_y)^2),
    distance = log(distance),
    behind_goal_line = ifelse(abs(coord_x) >= 89, 1, 0)
  ) %>%
  step_rm(coord_x, coord_y) %>%
  step_zv(all_predictors()) %>%
  step_ns(angle, deg_free = tune("angle")) %>%
  step_ns(distance, deg_free = tune("distance")) %>%
  step_normalize(all_numeric_predictors())

Let’s tune() our spline terms!

Choosing tuning parameters

Let’s take our previous recipe and add a few changes:

glm_spline_wflow <-
  workflow() %>%
  add_model(logistic_reg()) %>%
  add_recipe(glm_rec)

Splines and nonlinear relationships

Create a grid

glm_spline_wflow %>% 
  extract_parameter_set_dials()
#> Collection of 2 parameters for tuning
#> 
#>  identifier     type    object
#>       angle deg_free nparam[+]
#>    distance deg_free nparam[+]

A parameter set can be updated (e.g. to change the ranges).

Create a grid

set.seed(2)
grid <- 
  glm_spline_wflow %>% 
  extract_parameter_set_dials() %>% 
  grid_latin_hypercube(size = 25)

grid
#> # A tibble: 23 Γ— 2
#>    angle distance
#>    <int>    <int>
#>  1    12        4
#>  2    15        8
#>  3     6       14
#>  4    10        5
#>  5    12       12
#>  6     7        8
#>  7    14        3
#>  8    14       13
#>  9    11       12
#> 10     8       11
#> # … with 13 more rows
#> # β„Ή Use `print(n = ...)` to see more rows
  • A space-filling design like this tends to perform better than random grids.
  • Space-filling designs are also usually more efficient than regular grids.

Your turn

Create a grid for our tunable workflow.

Try creating a regular grid.

03:00

Create a grid

set.seed(2)
grid <- 
  glm_spline_wflow %>% 
  extract_parameter_set_dials() %>% 
  grid_regular(levels = 25)

grid
#> # A tibble: 225 Γ— 2
#>    angle distance
#>    <int>    <int>
#>  1     1        1
#>  2     2        1
#>  3     3        1
#>  4     4        1
#>  5     5        1
#>  6     6        1
#>  7     7        1
#>  8     8        1
#>  9     9        1
#> 10    10        1
#> # … with 215 more rows
#> # β„Ή Use `print(n = ...)` to see more rows

Update parameter ranges

set.seed(2)
grid <- 
  glm_spline_wflow %>% 
  extract_parameter_set_dials() %>% 
  update(angle = spline_degree(c(2L, 20L)),
         distance = spline_degree(c(2L, 20L))) %>% 
  grid_latin_hypercube(size = 25)

grid
#> # A tibble: 24 Γ— 2
#>    angle distance
#>    <int>    <int>
#>  1    16        6
#>  2    20       11
#>  3     8       19
#>  4    14        7
#>  5    16       17
#>  6    10       11
#>  7    19        5
#>  8    18       17
#>  9    15       16
#> 10    11       15
#> # … with 14 more rows
#> # β„Ή Use `print(n = ...)` to see more rows

The results

grid %>% 
  ggplot(aes(angle, distance)) +
  geom_point(size = 4)

Use the tune_*() functions to tune models

Your turn

Tune our glm_wflow.

What happens if you don’t supply a grid argument to tune_grid()?

05:00

Grid results

autoplot(glm_spline_res)

Tuning results

collect_metrics(glm_spline_res)
#> # A tibble: 48 Γ— 8
#>    angle distance .metric  .estimator  mean     n std_err .config              
#>    <int>    <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
#>  1    16        6 accuracy binary     0.610     1      NA Preprocessor01_Model1
#>  2    16        6 roc_auc  binary     0.649     1      NA Preprocessor01_Model1
#>  3    20       11 accuracy binary     0.613     1      NA Preprocessor02_Model1
#>  4    20       11 roc_auc  binary     0.649     1      NA Preprocessor02_Model1
#>  5     8       19 accuracy binary     0.619     1      NA Preprocessor03_Model1
#>  6     8       19 roc_auc  binary     0.652     1      NA Preprocessor03_Model1
#>  7    14        7 accuracy binary     0.610     1      NA Preprocessor04_Model1
#>  8    14        7 roc_auc  binary     0.650     1      NA Preprocessor04_Model1
#>  9    16       17 accuracy binary     0.617     1      NA Preprocessor05_Model1
#> 10    16       17 roc_auc  binary     0.649     1      NA Preprocessor05_Model1
#> # … with 38 more rows
#> # β„Ή Use `print(n = ...)` to see more rows

Tuning results

collect_metrics(glm_spline_res, summarize = FALSE)
#> # A tibble: 48 Γ— 7
#>    id         angle distance .metric  .estimator .estimate .config              
#>    <chr>      <int>    <int> <chr>    <chr>          <dbl> <chr>                
#>  1 validation    16        6 accuracy binary         0.610 Preprocessor01_Model1
#>  2 validation    16        6 roc_auc  binary         0.649 Preprocessor01_Model1
#>  3 validation    20       11 accuracy binary         0.613 Preprocessor02_Model1
#>  4 validation    20       11 roc_auc  binary         0.649 Preprocessor02_Model1
#>  5 validation     8       19 accuracy binary         0.619 Preprocessor03_Model1
#>  6 validation     8       19 roc_auc  binary         0.652 Preprocessor03_Model1
#>  7 validation    14        7 accuracy binary         0.610 Preprocessor04_Model1
#>  8 validation    14        7 roc_auc  binary         0.650 Preprocessor04_Model1
#>  9 validation    16       17 accuracy binary         0.617 Preprocessor05_Model1
#> 10 validation    16       17 roc_auc  binary         0.649 Preprocessor05_Model1
#> # … with 38 more rows
#> # β„Ή Use `print(n = ...)` to see more rows

Choose a parameter combination

show_best(glm_spline_res, metric = "roc_auc")
#> # A tibble: 5 Γ— 8
#>   angle distance .metric .estimator  mean     n std_err .config              
#>   <int>    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1     5        8 roc_auc binary     0.653     1      NA Preprocessor12_Model1
#> 2     6        9 roc_auc binary     0.653     1      NA Preprocessor17_Model1
#> 3     3       15 roc_auc binary     0.653     1      NA Preprocessor13_Model1
#> 4     3       13 roc_auc binary     0.652     1      NA Preprocessor11_Model1
#> 5     5       19 roc_auc binary     0.652     1      NA Preprocessor24_Model1

Choose a parameter combination

Create your own tibble for final parameters or use one of the tune::select_*() functions:

select_best(glm_spline_res, metric = "roc_auc")
#> # A tibble: 1 Γ— 3
#>   angle distance .config              
#>   <int>    <int> <chr>                
#> 1     5        8 Preprocessor12_Model1

This best result has:

  • low-degree spline for angle (less β€œwiggly”, less complex)
  • higher-degree spline for distance (more β€œwiggly”, more complex)

Your turn

Try an alternative selection strategy.

Read the docs for select_by_pct_loss().

Try choosing a model that has a simpler (less β€œwiggly”) relationship for distance.

05:00

Choose a parameter combination

select_best(glm_spline_res, metric = "roc_auc")
#> # A tibble: 1 Γ— 3
#>   angle distance .config              
#>   <int>    <int> <chr>                
#> 1     5        8 Preprocessor12_Model1
select_by_pct_loss(glm_spline_res, distance, metric = "roc_auc")
#> # A tibble: 1 Γ— 10
#>   angle distance .metric .estimator  mean     n std_err .config               .best .loss
#>   <int>    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 <dbl> <dbl>
#> 1    13        4 roc_auc binary     0.646     1      NA Preprocessor20_Model1 0.653 0.984

Boosted trees 🌳🌲🌴🌡🌴🌳🌳🌴🌲🌡🌴🌲🌳🌴🌳🌡🌡🌴🌲🌲🌳🌴🌳🌴🌲🌴🌡🌴🌲🌴🌡🌲🌡🌴🌲🌳🌴🌡🌳🌴🌳

Boosted trees 🌳🌲🌴🌡🌳🌳🌴🌲🌡🌴🌳🌡

  • Ensemble many decision tree models

Review how a decision tree model works:

  • Series of splits or if/then statements based on predictors

  • First the tree grows until some condition is met (maximum depth, no more data)

  • Then the tree is pruned to reduce its complexity

Single decision tree

Boosted trees 🌳🌲🌴🌡🌳🌳🌴🌲🌡🌴🌳🌡

Boosting methods fit a sequence of tree-based models.

  • Each tree is dependent on the one before and tries to compensate for any poor results in the previous trees.

  • This is like gradient-based steepest ascent methods from calculus.

Boosted tree tuning parameters

Most modern boosting methods have a lot of tuning parameters!

  • For tree growth and pruning (min_n, max_depth, etc)

  • For boosting (trees, stop_iter, learn_rate)

We’ll use early stopping to stop boosting when a few iterations produce consecutively worse results.

Comparing tree ensembles

Random forest

  • Independent trees
  • Bootstrapped data
  • No pruning
  • 1000’s of trees

Boosting

  • Dependent trees
  • Different case weights
  • Tune tree parameters
  • Far fewer trees

The general consensus for tree-based models is, in terms of performance: boosting > random forest > bagging > single trees.

Boosted tree code

xgb_spec <-
  boost_tree(
    trees = 500, min_n = tune(), stop_iter = tune(), tree_depth = tune(),
    learn_rate = tune(), loss_reduction = tune()
  ) %>%
  set_mode("classification") %>% 
  set_engine("xgboost", validation = 1/10) # <- for better early stopping

xgb_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_lencode_mixed(player, outcome = vars(on_goal)) %>% 
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

xgb_wflow <- 
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(xgb_rec)

Your turn

Create your boosted tree workflow.

03:00

Running in parallel

  • Grid search, combined with resampling, requires fitting a lot of models!

  • These models don’t depend on one another and can be run in parallel.

We can use a parallel backend to do this:

cores <- parallel::detectCores(logical = FALSE)
cl <- parallel::makePSOCKcluster(cores)
doParallel::registerDoParallel(cl)

# Now call `tune_grid()`!

# Shut it down with:
foreach::registerDoSEQ()
parallel::stopCluster(cl)

Running in parallel

Speed-ups are fairly linear up to the number of physical cores (10 here).

Tuning

This will take some time to run ⏳

set.seed(9)

xgb_res <-
  xgb_wflow %>%
  tune_grid(resamples = nhl_val, grid = 15, control = ctrl) # automatic grid now!

Your turn

Start tuning the boosted tree model!

We won’t wait for everyone’s tuning to finish, but take this time to get it started before we move on.

03:00

Tuning results

xgb_res
#> # Tuning results
#> # Validation Set Split (0.8/0.2)  
#> # A tibble: 1 Γ— 5
#>   splits              id         .metrics          .notes           .predictions          
#>   <list>              <chr>      <list>            <list>           <list>                
#> 1 <split [7288/1822]> validation <tibble [30 Γ— 9]> <tibble [0 Γ— 3]> <tibble [27,330 Γ— 11]>

Tuning results

autoplot(xgb_res)

Again with the location features

coord_rec <- 
  xgb_rec %>%
  step_mutate(
    angle = abs(atan2(abs(coord_y), (89 - abs(coord_x))) * (180 / pi)),
    distance = sqrt((89 - abs(coord_x))^2 + abs(coord_y)^2),
    distance = log(distance),
    behind_goal_line = ifelse(abs(coord_x) >= 89, 1, 0)
  ) %>% 
  step_rm(coord_x, coord_y)

xgb_coord_wflow <- 
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(coord_rec)

set.seed(9)
xgb_coord_res <-
  xgb_coord_wflow %>%
  tune_grid(resamples = nhl_val, grid = 20, control = ctrl)

Did the machine figure it out?

show_best(xgb_res, metric = "roc_auc")
#> # A tibble: 5 Γ— 11
#>   min_n tree_depth learn_rate loss_reduction stop_iter .metric .estimator  mean     n std_err .config              
#>   <int>      <int>      <dbl>          <dbl>     <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1    19          2    0.311         1.46e- 4        14 roc_auc binary     0.633     1      NA Preprocessor1_Model12
#> 2    11          3    0.0255        5.61e- 7        19 roc_auc binary     0.628     1      NA Preprocessor1_Model05
#> 3    27          5    0.0120        6.04e- 6        12 roc_auc binary     0.627     1      NA Preprocessor1_Model07
#> 4    25         14    0.0379        3.62e- 5         8 roc_auc binary     0.626     1      NA Preprocessor1_Model13
#> 5    31         11    0.00585       1.02e-10         7 roc_auc binary     0.625     1      NA Preprocessor1_Model08

show_best(xgb_coord_res, metric = "roc_auc")
#> # A tibble: 5 Γ— 11
#>   min_n tree_depth learn_rate loss_reduction stop_iter .metric .estimator  mean     n std_err .config              
#>   <int>      <int>      <dbl>          <dbl>     <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1    30         13     0.0578  0.00000000141        18 roc_auc binary     0.648     1      NA Preprocessor1_Model07
#> 2    39         12     0.0803  0.000411             10 roc_auc binary     0.643     1      NA Preprocessor1_Model12
#> 3    14          2     0.146   0.00244              19 roc_auc binary     0.642     1      NA Preprocessor1_Model14
#> 4    26         15     0.0365  2.51                 17 roc_auc binary     0.642     1      NA Preprocessor1_Model11
#> 5    35          5     0.101   0.0000000784         13 roc_auc binary     0.641     1      NA Preprocessor1_Model17

Compare models

Best logistic regression results:

glm_spline_res %>% 
  show_best(metric = "roc_auc", n = 1) %>% 
  select(.metric, .estimator, mean, n, std_err, .config)
#> # A tibble: 1 Γ— 6
#>   .metric .estimator  mean     n std_err .config              
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1 roc_auc binary     0.653     1      NA Preprocessor12_Model1

Best boosting results:

xgb_coord_res %>% 
  show_best(metric = "roc_auc", n = 1) %>% 
  select(.metric, .estimator, mean, n, std_err, .config)
#> # A tibble: 1 Γ— 6
#>   .metric .estimator  mean     n std_err .config              
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1 roc_auc binary     0.648     1      NA Preprocessor1_Model07

Your turn

Can you get better ROC results with xgboost?

Try increasing learn_rate beyond the original range.

20:00

Updating the workflow

best_auc <- select_best(glm_spline_res, metric = "roc_auc")
best_auc
#> # A tibble: 1 Γ— 3
#>   angle distance .config              
#>   <int>    <int> <chr>                
#> 1     5        8 Preprocessor12_Model1

glm_spline_wflow <-
  glm_spline_wflow %>% 
  finalize_workflow(best_auc)

glm_spline_wflow
#> ══ Workflow ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> 8 Recipe Steps
#> 
#> β€’ step_lencode_mixed()
#> β€’ step_dummy()
#> β€’ step_mutate()
#> β€’ step_rm()
#> β€’ step_zv()
#> β€’ step_ns()
#> β€’ step_ns()
#> β€’ step_normalize()
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

The final fit to the NHL data

test_res <- 
  glm_spline_wflow %>% 
  last_fit(split = nhl_split)

test_res
#> # Resampling results
#> # Manual resampling 
#> # A tibble: 1 Γ— 6
#>   splits              id               .metrics         .notes           .predictions         .workflow 
#>   <list>              <chr>            <list>           <list>           <list>               <list>    
#> 1 <split [9110/3037]> train/test split <tibble [2 Γ— 4]> <tibble [1 Γ— 3]> <tibble [3,037 Γ— 6]> <workflow>
#> 
#> There were issues with some computations:
#> 
#>   - Warning(s) x1: prediction from a rank-deficient fit may be misleading
#> 
#> Run `show_notes(.Last.tune.result)` for more information.

Remember that last_fit() fits one time with the combined training and validation set, then evaluates one time with the testing set.

Your turn

Finalize your workflow with the best parameters.

Create a final fit.

08:00

Estimates of ROC AUC

Validation results from tuning:

glm_spline_res %>% 
  show_best(metric = "roc_auc", n = 1) %>% 
  select(.metric, mean, n, std_err)
#> # A tibble: 1 Γ— 4
#>   .metric  mean     n std_err
#>   <chr>   <dbl> <int>   <dbl>
#> 1 roc_auc 0.653     1      NA

Test set results:

test_res %>% collect_metrics()
#> # A tibble: 2 Γ— 4
#>   .metric  .estimator .estimate .config             
#>   <chr>    <chr>          <dbl> <chr>               
#> 1 accuracy binary         0.616 Preprocessor1_Model1
#> 2 roc_auc  binary         0.656 Preprocessor1_Model1

Final fitted workflow

Extract the final fitted workflow, fit using the training set:

final_glm_spline_wflow <- 
  test_res %>% 
  extract_workflow()

# use this object to predict or deploy
predict(final_glm_spline_wflow, nhl_test[1:3,])
#> # A tibble: 3 Γ— 1
#>   .pred_class
#>   <fct>      
#> 1 no         
#> 2 yes        
#> 3 no

Next steps

  • Use explainers to characterize the model and the predictions.

Explain yourself

There are two categories of model explanations, global and local.

  • Global model explanations provide an overall understanding aggregated over a whole set of observations.

  • Local model explanations provide information about a prediction for a single observation.

You can also build global model explanations by aggregating local model explanations.

tidymodels integrates with model explainability frameworks

A tidymodels explainer

We can build explainers using:

  • original, basic predictors
  • derived features
library(DALEXtra)

glm_explainer <- explain_tidymodels(
  final_glm_spline_wflow,
  data = dplyr::select(nhl_train, -on_goal),
  # DALEX required an integer for factors:
  y = as.integer(nhl_train$on_goal),
  verbose = FALSE
)

Explain the x coordinates

With our explainer, let’s create partial dependence profiles:

set.seed(123)
pdp_coord_x <- model_profile(
  glm_explainer,
  variables = "coord_x",
  N = 500,
  groups = "position"
)

You can use the default plot() method or create your own visualization.

Explain the x coordinates

Explain the x coordinates

Your turn

Create an explainer for our glm model.

Try grouping by another variable, like game_type or dow.

05:00