low-degree spline for angle (less βwigglyβ, less complex)
higher-degree spline for coord_x (more βwigglyβ, more complex)
Boosted trees π³π²π΄π΅π΄π³π³π΄π²π΅π΄π²π³π΄π³π΅π΅π΄π²π²π³π΄π³π΄π²π΄π΅π΄π²π΄π΅π²π΅π΄π²π³π΄π΅π³π΄π³
Boosted trees π³π²π΄π΅π³π³π΄π²π΅π΄π³π΅
Ensemble many decision tree models
Review how a decision tree model works:
Series of splits or if/then statements based on predictors
First the tree grows until some condition is met (maximum depth, no more data)
Then the tree is pruned to reduce its complexity
Single decision tree
Boosted trees π³π²π΄π΅π³π³π΄π²π΅π΄π³π΅
Boosting methods fit a sequence of tree-based models.
Each tree is dependent on the one before and tries to compensate for any poor results in the previous trees.
This is like gradient-based steepest ascent methods from calculus.
Boosted tree tuning parameters
Most modern boosting methods have a lot of tuning parameters!
For tree growth and pruning (min_n, max_depth, etc)
For boosting (trees, stop_iter, learn_rate)
Weβll use early stopping to stop boosting when a few iterations produce consecutively worse results.
Comparing tree ensembles
Random forest
Independent trees
Bootstrapped data
No pruning
1000βs of trees
Boosting
Dependent trees
Different case weights
Tune tree parameters
Far fewer trees
The general consensus for tree-based models is, in terms of performance: boosting > random forest > bagging > single trees.
Grid search, combined with resampling, requires fitting a lot of models!
These models donβt depend on one another and can be run in parallel.
We can use a parallel backend to do this:
cores <- parallelly::availableCores(logical =FALSE)cl <- parallel::makePSOCKcluster(cores)doParallel::registerDoParallel(cl)# Now call `tune_grid()`!# Shut it down with:foreach::registerDoSEQ()parallel::stopCluster(cl)
Running in parallel
Speed-ups are fairly linear up to the number of physical cores (10 here).
Remember that last_fit() fits one time with the combined training and validation set, then evaluates one time with the testing set.
Your turn
Finalize your workflow with the best parameters.
Create a final fit.
08:00
Estimates of ROC AUC
Validation results from tuning:
glm_spline_res %>%show_best(metric ="roc_auc", n =1) %>%select(.metric, mean, n, std_err)#> # A tibble: 1 Γ 4#> .metric mean n std_err#> <chr> <dbl> <int> <dbl>#> 1 roc_auc 0.879 1 NA
Extract the final fitted workflow, fit using the training set:
final_glm_spline_wflow <- test_res %>%extract_workflow()# use this object to predict or deploypredict(final_glm_spline_wflow, nhl_test[1:3,])#> # A tibble: 3 Γ 1#> .pred_class#> <fct> #> 1 no #> 2 yes #> 3 yes