# 6 - Tuning Hyperparameters

Machine learning with tidymodels

## Tuning parameters

Some model or preprocessing parameters cannot be estimated directly from the data.

Some examples:

• Tree depth in decision trees
• Number of neighbors in a K-nearest neighbor model

# Activation function in neural networks?

Sigmoidal functions, ReLu, etc.

Yes, it is a tuning parameter. ✅

# Number of PCA components to retain?

Yes, it is a tuning parameter. ✅

# Bayesian priors for model parameters?

Hmmmm, probably not. These are based on prior belief. ❌

# Covariance/correlation matrix structure in mixed models?

Yes, but it is unlikely to affect performance.

It will impact inference though. 🤔

# Is the random seed a tuning parameter?

Nope. It is not. ❌

## Optimize tuning parameters

• Try different values and measure their performance.
• Find good values for these parameters.
• Once the value(s) of the parameter(s) are determined, a model can be finalized by fitting the model to the entire training set.

## Optimize tuning parameters

The main two strategies for optimization are:

• Grid search 💠 which tests a pre-defined set of candidate values

• Iterative search 🌀 which suggests/estimates new values of candidate parameters to evaluate

## Choosing tuning parameters

Let’s take our previous recipe and add a few changes:

glm_rec <-
recipe(on_goal ~ ., data = nhl_train) %>%
step_lencode_mixed(player, outcome = vars(on_goal)) %>%
step_dummy(all_nominal_predictors()) %>%
step_mutate(
angle = abs(atan2(abs(coord_y), (89 - abs(coord_x))) * (180 / pi)),
distance = sqrt((89 - abs(coord_x))^2 + abs(coord_y)^2),
distance = log(distance),
behind_goal_line = ifelse(abs(coord_x) >= 89, 1, 0)
) %>%
step_rm(coord_x, coord_y) %>%
step_zv(all_predictors()) %>%
step_ns(angle, deg_free = tune("angle")) %>%
step_ns(distance, deg_free = tune("distance")) %>%
step_normalize(all_numeric_predictors())

Let’s tune() our spline terms!

## Choosing tuning parameters

Let’s take our previous recipe and add a few changes:

glm_spline_wflow <-
workflow() %>%
add_recipe(glm_rec)

## Create a grid

glm_spline_wflow %>%
extract_parameter_set_dials()
#> Collection of 2 parameters for tuning
#>
#>  identifier     type    object
#>       angle deg_free nparam[+]
#>    distance deg_free nparam[+]

A parameter set can be updated (e.g. to change the ranges).

## Create a grid

set.seed(2)
grid <-
glm_spline_wflow %>%
extract_parameter_set_dials() %>%
grid_latin_hypercube(size = 25)

grid
#> # A tibble: 23 × 2
#>    angle distance
#>    <int>    <int>
#>  1    12        4
#>  2    15        8
#>  3     6       14
#>  4    10        5
#>  5    12       12
#>  6     7        8
#>  7    14        3
#>  8    14       13
#>  9    11       12
#> 10     8       11
#> # … with 13 more rows
#> # ℹ Use print(n = ...) to see more rows
• A space-filling design like this tends to perform better than random grids.
• Space-filling designs are also usually more efficient than regular grids.

Create a grid for our tunable workflow.

Try creating a regular grid.

03:00

## Create a grid

set.seed(2)
grid <-
glm_spline_wflow %>%
extract_parameter_set_dials() %>%
grid_regular(levels = 25)

grid
#> # A tibble: 225 × 2
#>    angle distance
#>    <int>    <int>
#>  1     1        1
#>  2     2        1
#>  3     3        1
#>  4     4        1
#>  5     5        1
#>  6     6        1
#>  7     7        1
#>  8     8        1
#>  9     9        1
#> 10    10        1
#> # … with 215 more rows
#> # ℹ Use print(n = ...) to see more rows

## Update parameter ranges

set.seed(2)
grid <-
glm_spline_wflow %>%
extract_parameter_set_dials() %>%
update(angle = spline_degree(c(2L, 20L)),
distance = spline_degree(c(2L, 20L))) %>%
grid_latin_hypercube(size = 25)

grid
#> # A tibble: 24 × 2
#>    angle distance
#>    <int>    <int>
#>  1    16        6
#>  2    20       11
#>  3     8       19
#>  4    14        7
#>  5    16       17
#>  6    10       11
#>  7    19        5
#>  8    18       17
#>  9    15       16
#> 10    11       15
#> # … with 14 more rows
#> # ℹ Use print(n = ...) to see more rows

## The results

grid %>%
ggplot(aes(angle, distance)) +
geom_point(size = 4)

# Use the tune_*() functions to tune models

Tune our glm_wflow.

What happens if you don’t supply a grid argument to tune_grid()?

05:00

## Grid results

autoplot(glm_spline_res)

## Tuning results

collect_metrics(glm_spline_res)
#> # A tibble: 48 × 8
#>    angle distance .metric  .estimator  mean     n std_err .config
#>    <int>    <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>
#>  1    16        6 accuracy binary     0.610     1      NA Preprocessor01_Model1
#>  2    16        6 roc_auc  binary     0.649     1      NA Preprocessor01_Model1
#>  3    20       11 accuracy binary     0.613     1      NA Preprocessor02_Model1
#>  4    20       11 roc_auc  binary     0.649     1      NA Preprocessor02_Model1
#>  5     8       19 accuracy binary     0.619     1      NA Preprocessor03_Model1
#>  6     8       19 roc_auc  binary     0.652     1      NA Preprocessor03_Model1
#>  7    14        7 accuracy binary     0.610     1      NA Preprocessor04_Model1
#>  8    14        7 roc_auc  binary     0.650     1      NA Preprocessor04_Model1
#>  9    16       17 accuracy binary     0.617     1      NA Preprocessor05_Model1
#> 10    16       17 roc_auc  binary     0.649     1      NA Preprocessor05_Model1
#> # … with 38 more rows
#> # ℹ Use print(n = ...) to see more rows

## Tuning results

collect_metrics(glm_spline_res, summarize = FALSE)
#> # A tibble: 48 × 7
#>    id         angle distance .metric  .estimator .estimate .config
#>    <chr>      <int>    <int> <chr>    <chr>          <dbl> <chr>
#>  1 validation    16        6 accuracy binary         0.610 Preprocessor01_Model1
#>  2 validation    16        6 roc_auc  binary         0.649 Preprocessor01_Model1
#>  3 validation    20       11 accuracy binary         0.613 Preprocessor02_Model1
#>  4 validation    20       11 roc_auc  binary         0.649 Preprocessor02_Model1
#>  5 validation     8       19 accuracy binary         0.619 Preprocessor03_Model1
#>  6 validation     8       19 roc_auc  binary         0.652 Preprocessor03_Model1
#>  7 validation    14        7 accuracy binary         0.610 Preprocessor04_Model1
#>  8 validation    14        7 roc_auc  binary         0.650 Preprocessor04_Model1
#>  9 validation    16       17 accuracy binary         0.617 Preprocessor05_Model1
#> 10 validation    16       17 roc_auc  binary         0.649 Preprocessor05_Model1
#> # … with 38 more rows
#> # ℹ Use print(n = ...) to see more rows

## Choose a parameter combination

show_best(glm_spline_res, metric = "roc_auc")
#> # A tibble: 5 × 8
#>   angle distance .metric .estimator  mean     n std_err .config
#>   <int>    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
#> 1     5        8 roc_auc binary     0.653     1      NA Preprocessor12_Model1
#> 2     6        9 roc_auc binary     0.653     1      NA Preprocessor17_Model1
#> 3     3       15 roc_auc binary     0.653     1      NA Preprocessor13_Model1
#> 4     3       13 roc_auc binary     0.652     1      NA Preprocessor11_Model1
#> 5     5       19 roc_auc binary     0.652     1      NA Preprocessor24_Model1

## Choose a parameter combination

Create your own tibble for final parameters or use one of the tune::select_*() functions:

select_best(glm_spline_res, metric = "roc_auc")
#> # A tibble: 1 × 3
#>   angle distance .config
#>   <int>    <int> <chr>
#> 1     5        8 Preprocessor12_Model1

This best result has:

• low-degree spline for angle (less “wiggly”, less complex)
• higher-degree spline for distance (more “wiggly”, more complex)

Try an alternative selection strategy.

Read the docs for select_by_pct_loss().

Try choosing a model that has a simpler (less “wiggly”) relationship for distance.

05:00

## Choose a parameter combination

select_best(glm_spline_res, metric = "roc_auc")
#> # A tibble: 1 × 3
#>   angle distance .config
#>   <int>    <int> <chr>
#> 1     5        8 Preprocessor12_Model1
select_by_pct_loss(glm_spline_res, distance, metric = "roc_auc")
#> # A tibble: 1 × 10
#>   angle distance .metric .estimator  mean     n std_err .config               .best .loss
#>   <int>    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 <dbl> <dbl>
#> 1    13        4 roc_auc binary     0.646     1      NA Preprocessor20_Model1 0.653 0.984

# Boosted trees 🌳🌲🌴🌵🌴🌳🌳🌴🌲🌵🌴🌲🌳🌴🌳🌵🌵🌴🌲🌲🌳🌴🌳🌴🌲🌴🌵🌴🌲🌴🌵🌲🌵🌴🌲🌳🌴🌵🌳🌴🌳

## Boosted trees 🌳🌲🌴🌵🌳🌳🌴🌲🌵🌴🌳🌵

• Ensemble many decision tree models

### Review how a decision tree model works:

• Series of splits or if/then statements based on predictors

• First the tree grows until some condition is met (maximum depth, no more data)

• Then the tree is pruned to reduce its complexity

## Boosted trees 🌳🌲🌴🌵🌳🌳🌴🌲🌵🌴🌳🌵

Boosting methods fit a sequence of tree-based models.

• Each tree is dependent on the one before and tries to compensate for any poor results in the previous trees.

• This is like gradient-based steepest ascent methods from calculus.

## Boosted tree tuning parameters

Most modern boosting methods have a lot of tuning parameters!

• For tree growth and pruning (min_n, max_depth, etc)

• For boosting (trees, stop_iter, learn_rate)

We’ll use early stopping to stop boosting when a few iterations produce consecutively worse results.

## Comparing tree ensembles

Random forest

• Independent trees
• Bootstrapped data
• No pruning
• 1000’s of trees

Boosting

• Dependent trees
• Different case weights
• Tune tree parameters
• Far fewer trees

The general consensus for tree-based models is, in terms of performance: boosting > random forest > bagging > single trees.

## Boosted tree code

xgb_spec <-
boost_tree(
trees = 500, min_n = tune(), stop_iter = tune(), tree_depth = tune(),
learn_rate = tune(), loss_reduction = tune()
) %>%
set_mode("classification") %>%
set_engine("xgboost", validation = 1/10) # <- for better early stopping

xgb_rec <-
recipe(on_goal ~ ., data = nhl_train) %>%
step_lencode_mixed(player, outcome = vars(on_goal)) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())

xgb_wflow <-
workflow() %>%
add_recipe(xgb_rec)

03:00

## Running in parallel

• Grid search, combined with resampling, requires fitting a lot of models!

• These models don’t depend on one another and can be run in parallel.

We can use a parallel backend to do this:

cores <- parallel::detectCores(logical = FALSE)
cl <- parallel::makePSOCKcluster(cores)
doParallel::registerDoParallel(cl)

# Now call tune_grid()!

# Shut it down with:
foreach::registerDoSEQ()
parallel::stopCluster(cl)

## Running in parallel

Speed-ups are fairly linear up to the number of physical cores (10 here).

## Tuning

This will take some time to run ⏳

set.seed(9)

xgb_res <-
xgb_wflow %>%
tune_grid(resamples = nhl_val, grid = 15, control = ctrl) # automatic grid now!

Start tuning the boosted tree model!

We won’t wait for everyone’s tuning to finish, but take this time to get it started before we move on.

03:00

## Tuning results

xgb_res
#> # Tuning results
#> # Validation Set Split (0.8/0.2)
#> # A tibble: 1 × 5
#>   splits              id         .metrics          .notes           .predictions
#>   <list>              <chr>      <list>            <list>           <list>
#> 1 <split [7288/1822]> validation <tibble [30 × 9]> <tibble [0 × 3]> <tibble [27,330 × 11]>

## Tuning results

autoplot(xgb_res)

## Again with the location features

coord_rec <-
xgb_rec %>%
step_mutate(
angle = abs(atan2(abs(coord_y), (89 - abs(coord_x))) * (180 / pi)),
distance = sqrt((89 - abs(coord_x))^2 + abs(coord_y)^2),
distance = log(distance),
behind_goal_line = ifelse(abs(coord_x) >= 89, 1, 0)
) %>%
step_rm(coord_x, coord_y)

xgb_coord_wflow <-
workflow() %>%

set.seed(9)
xgb_coord_res <-
xgb_coord_wflow %>%
tune_grid(resamples = nhl_val, grid = 20, control = ctrl)

## Did the machine figure it out?

show_best(xgb_res, metric = "roc_auc")
#> # A tibble: 5 × 11
#>   min_n tree_depth learn_rate loss_reduction stop_iter .metric .estimator  mean     n std_err .config
#>   <int>      <int>      <dbl>          <dbl>     <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
#> 1    19          2    0.311         1.46e- 4        14 roc_auc binary     0.633     1      NA Preprocessor1_Model12
#> 2    11          3    0.0255        5.61e- 7        19 roc_auc binary     0.628     1      NA Preprocessor1_Model05
#> 3    27          5    0.0120        6.04e- 6        12 roc_auc binary     0.627     1      NA Preprocessor1_Model07
#> 4    25         14    0.0379        3.62e- 5         8 roc_auc binary     0.626     1      NA Preprocessor1_Model13
#> 5    31         11    0.00585       1.02e-10         7 roc_auc binary     0.625     1      NA Preprocessor1_Model08

show_best(xgb_coord_res, metric = "roc_auc")
#> # A tibble: 5 × 11
#>   min_n tree_depth learn_rate loss_reduction stop_iter .metric .estimator  mean     n std_err .config
#>   <int>      <int>      <dbl>          <dbl>     <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
#> 1    30         13     0.0578  0.00000000141        18 roc_auc binary     0.648     1      NA Preprocessor1_Model07
#> 2    39         12     0.0803  0.000411             10 roc_auc binary     0.643     1      NA Preprocessor1_Model12
#> 3    14          2     0.146   0.00244              19 roc_auc binary     0.642     1      NA Preprocessor1_Model14
#> 4    26         15     0.0365  2.51                 17 roc_auc binary     0.642     1      NA Preprocessor1_Model11
#> 5    35          5     0.101   0.0000000784         13 roc_auc binary     0.641     1      NA Preprocessor1_Model17

## Compare models

Best logistic regression results:

glm_spline_res %>%
show_best(metric = "roc_auc", n = 1) %>%
select(.metric, .estimator, mean, n, std_err, .config)
#> # A tibble: 1 × 6
#>   .metric .estimator  mean     n std_err .config
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>
#> 1 roc_auc binary     0.653     1      NA Preprocessor12_Model1

Best boosting results:

xgb_coord_res %>%
show_best(metric = "roc_auc", n = 1) %>%
select(.metric, .estimator, mean, n, std_err, .config)
#> # A tibble: 1 × 6
#>   .metric .estimator  mean     n std_err .config
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>
#> 1 roc_auc binary     0.648     1      NA Preprocessor1_Model07

Can you get better ROC results with xgboost?

Try increasing learn_rate beyond the original range.

20:00

## Updating the workflow

best_auc <- select_best(glm_spline_res, metric = "roc_auc")
best_auc
#> # A tibble: 1 × 3
#>   angle distance .config
#>   <int>    <int> <chr>
#> 1     5        8 Preprocessor12_Model1

glm_spline_wflow <-
glm_spline_wflow %>%
finalize_workflow(best_auc)

glm_spline_wflow
#> ══ Workflow ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> ── Preprocessor ──────────────────────────────────────────────────────
#> 8 Recipe Steps
#>
#> • step_lencode_mixed()
#> • step_dummy()
#> • step_mutate()
#> • step_rm()
#> • step_zv()
#> • step_ns()
#> • step_ns()
#> • step_normalize()
#>
#> ── Model ─────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#>
#> Computational engine: glm

## The final fit to the NHL data

test_res <-
glm_spline_wflow %>%
last_fit(split = nhl_split)

test_res
#> # Resampling results
#> # Manual resampling
#> # A tibble: 1 × 6
#>   splits              id               .metrics         .notes           .predictions         .workflow
#>   <list>              <chr>            <list>           <list>           <list>               <list>
#> 1 <split [9110/3037]> train/test split <tibble [2 × 4]> <tibble [1 × 3]> <tibble [3,037 × 6]> <workflow>
#>
#> There were issues with some computations:
#>
#>   - Warning(s) x1: prediction from a rank-deficient fit may be misleading
#>
#> Run show_notes(.Last.tune.result) for more information.

Remember that last_fit() fits one time with the combined training and validation set, then evaluates one time with the testing set.

Finalize your workflow with the best parameters.

Create a final fit.

08:00

## Estimates of ROC AUC

Validation results from tuning:

glm_spline_res %>%
show_best(metric = "roc_auc", n = 1) %>%
select(.metric, mean, n, std_err)
#> # A tibble: 1 × 4
#>   .metric  mean     n std_err
#>   <chr>   <dbl> <int>   <dbl>
#> 1 roc_auc 0.653     1      NA

Test set results:

test_res %>% collect_metrics()
#> # A tibble: 2 × 4
#>   .metric  .estimator .estimate .config
#>   <chr>    <chr>          <dbl> <chr>
#> 1 accuracy binary         0.616 Preprocessor1_Model1
#> 2 roc_auc  binary         0.656 Preprocessor1_Model1

## Final fitted workflow

Extract the final fitted workflow, fit using the training set:

final_glm_spline_wflow <-
test_res %>%
extract_workflow()

# use this object to predict or deploy
predict(final_glm_spline_wflow, nhl_test[1:3,])
#> # A tibble: 3 × 1
#>   .pred_class
#>   <fct>
#> 1 no
#> 2 yes
#> 3 no

## Next steps

• Use explainers to characterize the model and the predictions.

## Explain yourself

There are two categories of model explanations, global and local.

• Global model explanations provide an overall understanding aggregated over a whole set of observations.

• Local model explanations provide information about a prediction for a single observation.

You can also build global model explanations by aggregating local model explanations.

# tidymodels integrates with model explainability frameworks

## A tidymodels explainer

We can build explainers using:

• original, basic predictors
• derived features
library(DALEXtra)

glm_explainer <- explain_tidymodels(
final_glm_spline_wflow,
data = dplyr::select(nhl_train, -on_goal),
# DALEX required an integer for factors:
y = as.integer(nhl_train\$on_goal),
verbose = FALSE
)

## Explain the x coordinates

With our explainer, let’s create partial dependence profiles:

set.seed(123)
pdp_coord_x <- model_profile(
glm_explainer,
variables = "coord_x",
N = 500,
groups = "position"
)

You can use the default plot() method or create your own visualization.

## Explain the x coordinates

Try grouping by another variable, like game_type or dow.
05:00