The recipes package is an extensible framework for pipeable sequences of feature engineering steps that provide preprocessing tools to be applied to data.
Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.
The resulting processed output can be used as inputs for statistical or machine learning models.
A first recipe
nhl_rec <-recipe(on_goal ~ ., data = nhl_train)
The recipe() function assigns columns to roles of “outcome” or “predictor” using the formula
nhl_rec <-recipe(on_goal ~ ., data = nhl_train) %>%step_dummy(all_nominal_predictors())
For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.
Filter out constant columns
nhl_rec <-recipe(on_goal ~ ., data = nhl_train) %>%step_dummy(all_nominal_predictors()) %>%step_zv(all_predictors())
In case there is a factor level that was never observed in the training data (resulting in a column of all 0s), we can delete any zero-variance predictors that have a single unique value.
Normalization
nhl_rec <-recipe(on_goal ~ ., data = nhl_train) %>%step_dummy(all_nominal_predictors()) %>%step_zv(all_predictors()) %>%step_normalize(all_numeric_predictors())
This centers and scales the numeric predictors.
The recipe will use the training set to estimate the means and standard deviations of the data.
All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).
Compute and plot an ROC curve for your current model.
What data are being used for this ROC curve plot?
05:00
What do we do with the player data? 🏒
There are 598 unique player values in our training set. How can we include this information in our model?
We could:
make the full set of indicator variables 😳
lump players who rarely shoot into an “other” group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the shooter column with the estimated effect of that predictor
Let’s look at othering then effect encodings.
Per-player statistics
Collapsing factor levels
There is a recipe step that will redefine factor levels based on the their frequency in the training set:
nhl_other_rec <-recipe(on_goal ~ ., data = nhl_train) %>%# Any player with <= 0.01% of shots is set to "other"step_other(shooter, threshold =0.001) %>%step_dummy(all_nominal_predictors()) %>%step_zv(all_predictors())
Using this code, 402 players (out of 598) were collapsed into “other” based on the training set.
We could try to optimize the threshold for collapsing (see the next set of slides on model tuning).
Does othering help?
nhl_other_wflow <- nhl_glm_wflow %>%update_recipe(nhl_other_rec)nhl_other_res <- nhl_other_wflow %>%fit_resamples(nhl_val, control = ctrl)collect_metrics(nhl_other_res)#> # A tibble: 2 × 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 accuracy binary 0.778 1 NA Preprocessor1_Model1#> 2 roc_auc binary 0.804 1 NA Preprocessor1_Model1
A little better ROC AUC and much faster to complete.
Now let’s look at a more sophisticated tool called effect encodings.
What is an effect encoding?
We replace the qualitative’s predictor data with their effect on the outcome.
library(forcats)collect_metrics(nhl_glm_set_res) %>%filter(.metric =="roc_auc") %>%mutate(features =gsub("_logistic", "", wflow_id), features =fct_reorder(features, mean) ) %>%ggplot(aes(x = mean, y = features)) +geom_point(size =3) +labs(y =NULL, x ="ROC AUC (validation set)")
Compare recipes
Debugging a recipe
Typically, you will want to use a workflow to estimate and apply a recipe.
If you have an error and need to debug your recipe, the original recipe object (e.g. encoded_players) can be estimated manually with a function called prep(). It is analogous to fit(). See TMwR section 16.4
Another function (bake()) is analogous to predict(), and gives you the processed data back.
The tidy() function can be used to get specific results from the recipe.