The recipes package is an extensible framework for pipeable sequences of feature engineering steps that provide preprocessing tools to be applied to data.
Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.
The resulting processed output can be used as inputs for statistical or machine learning models.
A first recipe
nhl_rec <-recipe(on_goal ~ ., data = nhl_train)
The recipe() function assigns columns to roles of “outcome” or “predictor” using the formula
nhl_rec <-recipe(on_goal ~ ., data = nhl_train) %>%step_dummy(all_nominal_predictors())
For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.
Filter out constant columns
nhl_rec <-recipe(on_goal ~ ., data = nhl_train) %>%step_dummy(all_nominal_predictors()) %>%step_zv(all_predictors())
In case there is a factor level that was never observed in the training data (resulting in a column of all 0s), we can delete any zero-variance predictors that have a single unique value.
Normalization
nhl_rec <-recipe(on_goal ~ ., data = nhl_train) %>%step_dummy(all_nominal_predictors()) %>%step_zv(all_predictors()) %>%step_normalize(all_numeric_predictors())
This centers and scales the numeric predictors.
The recipe will use the training set to estimate the means and standard deviations of the data.
All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).
library(forcats)collect_metrics(nhl_glm_set_res) %>%filter(.metric =="roc_auc") %>%mutate(features =gsub("_logistic", "", wflow_id), features =fct_reorder(features, mean) ) %>%ggplot(aes(x = mean, y = features)) +geom_point(size =3) +labs(y =NULL, x ="ROC AUC (validation set)")
Compare recipes
Debugging a recipe
Typically, you will want to use a workflow to estimate and apply a recipe.
If you have an error and need to debug your recipe, the original recipe object (e.g. encoded_players) can be estimated manually with a function called prep(). It is analogous to fit().
Another function (bake()) is analogous to predict(), and gives you the processed data back.
More on recipes
Once fit() is called on a workflow, changing the model does not re-fit the recipe.