Getting More Out of Feature Engineering and Tuning for Machine Learning
Startup!
library(tidymodels)library(probably)library(desirability2)tidymodels_prefer()theme_set(theme_bw())options(pillar.advice =FALSE, pillar.min_title_chars =Inf)# check torch:if (torch::torch_is_installed()) {library(torch)}# Load our example data for this section"https://raw.githubusercontent.com/tidymodels/"|>paste0("workshops/main/slides/class_data.RData") |>url() |>load()
hidden_units: the primary way to specify model complexity.
activation: the name of the nonlinear function used to connect the predictors to the hidden layer.
Loss Function:
penalty: amount of regularization used to prevent overfitting.
mixture: the proportion of L1 and L2 penalties.
validation: proportion of data to leave out to assess early stopping.
Notable arguments Part 2
Optimization:
optimizer: the type of gradient-based optimization.
epochs: how many passes through the entire data set (i.e., iterations).
stop_iter: number of bad iterations before stopping.
learn_rate: how fast does gradient descent move?
rate_schedule: should the learning rate change over epochs?
batch_size: for stochastic gradient descent.
That’s a lot 😩
Cost-sensitive learning
One other option: class_weights: amount to upweight the minority class (event) when computing the objective function (cross-entropy).
We have a moderate class imbalance, and we’ll use this argument to deal with it.
This will push the minority class probability estimates to be more accurate/calibrated. Overall the model will be less effective; this assumes the minority class is the class of interest.
A single model
nnet_ex_spec <-mlp(hidden_units =20, penalty =0.01, learn_rate =0.005, epochs =100) |>set_engine("brulee", class_weights =3, stop_iter =10) |>set_mode("classification")rec <-recipe(class ~ ., data = sim_train) |>step_normalize(all_numeric_predictors())nnet_ex_wflow <-workflow(rec, nnet_ex_spec)# Fit on the first fold's 90% analysis setset.seed(147)nnet_ex_fit <-fit(nnet_ex_wflow, data =analysis(sim_rs$splits[[1]]))
Did it converge?
nnet_ex_fit |># pull out the brulee fit:extract_fit_engine() |>autoplot()
The y-axis statistics are computed on the held-out predictions at each iteration.
The vertical green line shows that early stopping occurred.
Some model or preprocessing parameters cannot be estimated directly from the data.
Some examples:
Tree depth in decision trees
Number of neighbors in a K-nearest neighbor model
Activation function in neural networks?
Sigmoidal functions, ReLu, etc.
Yes, it is a tuning parameter. ✅
Number of PCA columns to generate for feature extraction?
Yes, it is a preprocessing tuning parameter. ✅
The validation set size?
Nope! ❌
Bayesian priors for model parameters?
Hmmmm, probably not. These are based on prior belief. ❌
The class probability cutoff?
This is a value \(C\) used to threshold \(Pr[Class = 1] \ge C\).
For two classes, the default is \(C = 1/2\).
Yes, it is a postprocessing tuning parameter. ✅
The random seed?
Nope. It is not. ❌
Optimize tuning parameters
Try different values and measure their performance.
Find good values for these parameters.
Once the value(s) of the parameter(s) are determined, a model can be finalized by fitting the model to the entire training set.
Tagging parameters for tuning
With tidymodels, you can mark the parameters that you want to optimize with a value of tune().
The function itself just returns… itself:
tune()#> tune()str(tune())#> language tune()# optionally add a labeltune("I hope that the workshop is going well")#> tune("I hope that the workshop is going well")
The problem here is that we are biasing the probability estimates so that we can predict more data to be the rare “event” class using a default probability cutoff of 1/2.
That is compromising the overall model fit; our probabilities are not accurate.
If we don’t use the class probability estimates, this is fine.
In the postprocessing slides, we’ll examine an alternative approach that involves different kinds of tradeoffs.
Fitting a workflow
Let’s say that we want to train the model on the “best” parameter estimates.
We can use a tibble of tuning parameters and splice them into the workflow in place of tune():