Extras - Recipes
Introduction to tidymodels
Looking at the predictors
forested_train
#> # A tibble: 5,685 × 19
#> forested year elevation eastness northness roughness tree_no_tree dew_temp precip_annual temp_annual_mean temp_annual_min temp_annual_max temp_january_min vapor_min vapor_max canopy_cover lon
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 No 2016 464 -5 -99 7 No tree 1.71 282 9.76 -4.44 16.6 2.96 191 1534 4 -121.
#> 2 Yes 2016 166 92 37 7 Tree 6 1298 10.2 0.72 14.3 6.12 60 747 33 -122.
#> 3 No 2016 644 -85 -52 24 No tree 0.67 288 8.77 -6.32 14.6 2.98 219 1396 0 -120.
#> 4 Yes 2014 1285 4 99 79 Tree 1.91 1621 5.61 -2.48 9.48 1.73 88 545 74 -123.
#> 5 Yes 2013 822 87 48 68 Tree 1.95 2200 8.62 -0.68 12.9 4.35 147 861 48 -121.
#> 6 Yes 2017 3 6 -99 5 Tree 7.93 2211 10.6 3.77 14.2 7.02 34 578 79 -124.
#> 7 Yes 2014 2041 -95 28 49 Tree -4.22 1551 0.75 -9.47 5.17 -3.66 73 481 48 -120.
#> 8 Yes 2015 1009 -8 99 72 Tree 1.72 2396 6.59 -2.98 11.3 1.88 92 781 76 -122.
#> 9 No 2017 436 -98 19 10 No tree 1.8 234 9.8 -4.23 16.3 3.32 178 1527 0 -119.
#> 10 No 2018 775 63 76 103 No tree 0.62 432 8.51 -5.5 13.7 3.32 241 1237 7 -120.
#> # ℹ 5,675 more rows
#> # ℹ 2 more variables: lat <dbl>, land_type <fct>
Working with other models
Some models can’t handle non-numeric data
Linear Regression
K Nearest Neighbors
Some models struggle if numeric predictors aren’t scaled
K Nearest Neighbors
Anything using gradient descent
Types of needed preprocessing
Do qualitative predictors require a numeric encoding?
Should columns with a single unique value be removed?
Does the model struggle with missing data?
Does the model struggle with correlated predictors?
Should predictors be centered and scaled?
Is it helpful to transform predictors to be more symmetric?
Two types of preprocessing
Two types of preprocessing
General definitions
Data preprocessing is what you do to make your model successful .
Feature engineering is what you do to the original predictors to make the model do the least work to perform great.
Working with dates
Datetime variables are automatically converted to an integer if given as a raw predictor. To avoid this, it can be re-encoded as:
Days since a reference date
Day of the week
Month
Year
Leap year
Indicators for holidays
The recipes package
Modular + extensible
Works well with pipes ,|>
and %>%
Deferred evaluation
Isolates test data from training data
Can do things formulas can’t
How to write a recipe
forested_rec <- recipe(forested ~ ., data = forested_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_log(canopy_cover, offset = 0.5) %>%
step_normalize(all_numeric_predictors())
How to write a recipe
forested_rec <- recipe(forested ~ ., data = forested_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_log(canopy_cover, offset = 0.5) %>%
step_normalize(all_numeric_predictors())
Start by calling recipe()
to denote the data source and variables used.
How to write a recipe
forested_rec <- recipe(forested ~ ., data = forested_train) %>%
step_dummy (all_nominal_predictors()) %>%
step_zv (all_predictors()) %>%
step_log (canopy_cover, offset = 0.5) %>%
step_normalize (all_numeric_predictors())
Specify what actions to take by adding step_*()
s.
How to write a recipe
forested_rec <- recipe(forested ~ ., data = forested_train) %>%
step_dummy(all_nominal_predictors() ) %>%
step_zv(all_predictors() ) %>%
step_log(canopy_cover , offset = 0.5) %>% step_normalize(all_numeric_predictors() )
Use {tidyselect} and recipes-specific selectors to denote affected variables.
Using a recipe
forested_rec <- recipe(forested ~ ., data = forested_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_log(canopy_cover, offset = 0.5) %>% step_normalize(all_numeric_predictors())
Save the recipe we like so that we can use it in various places, e.g., with different models.
Using a recipe with workflows
Recipes are typically combined with a model in a workflow()
object:
forested_wflow <- workflow() %>%
add_recipe(forested_rec) %>%
add_model(linear_reg())
Recipes are estimated
Every preprocessing step in a recipe that involved calculations uses the training set. For example:
Levels of a factor
Determination of zero-variance
Normalization
Feature extraction
Once a recipe is added to a workflow, this occurs when fit()
is called.
Debugging a recipe
Typically, you will want to use a workflow to estimate and apply a recipe.
If you have an error and need to debug your recipe, the original recipe object (e.g. forested_rec
) can be estimated manually with a function called prep()
. It is analogous to fit()
. See TMwR section 16.4 .
Another function, bake()
, is analogous to predict()
, and gives you the processed data back.
Your turn
Take the recipe and prep()
then bake()
it to see what the resulting data set looks like.
Try removing steps to see how the result changes.
Printing a recipe
forested_rec
#>
#> ── Recipe ────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 18
#>
#> ── Operations
#> • Dummy variables from: all_nominal_predictors()
#> • Zero variance filter on: all_predictors()
#> • Log transformation on: canopy_cover
#> • Centering and scaling for: all_numeric_predictors()
Prepping a recipe
prep (forested_rec)
#>
#> ── Recipe ────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 18
#>
#> ── Training information
#> Training data contained 5685 data points and no incomplete rows.
#>
#> ── Operations
#> • Dummy variables from: tree_no_tree and land_type | Trained
#> • Zero variance filter removed: <none> | Trained
#> • Log transformation on: canopy_cover | Trained
#> • Centering and scaling for: year and elevation, ... | Trained
Baking a recipe
prep (forested_rec) %>%
bake (new_data = forested_train)
#> # A tibble: 5,685 × 20
#> year elevation eastness northness roughness dew_temp precip_annual temp_annual_mean temp_annual_min temp_annual_max temp_january_min vapor_min vapor_max canopy_cover lon lat forested
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 0.206 -0.450 -0.0203 -1.38 -0.874 -0.169 -0.864 0.532 -0.403 0.959 -0.0702 0.755 1.21 -0.566 -0.175 -0.988 No
#> 2 0.206 -1.07 1.38 0.563 -0.874 1.33 0.132 0.726 1.24 0.159 1.34 -1.00 -1.00 0.494 -0.856 1.10 Yes
#> 3 0.206 -0.0762 -1.17 -0.711 -0.506 -0.533 -0.858 0.116 -1.00 0.248 -0.0613 1.13 0.820 -1.73 0.270 0.175 No
#> 4 -0.413 1.26 0.109 1.45 0.683 -0.0987 0.448 -1.21 0.221 -1.56 -0.621 -0.625 -1.57 0.917 -1.52 0.654 Yes
#> 5 -0.723 0.294 1.31 0.721 0.445 -0.0847 1.02 0.0529 0.795 -0.346 0.552 0.166 -0.683 0.690 -0.419 1.32 Yes
#> 6 0.516 -1.41 0.138 -1.38 -0.917 2.01 1.03 0.894 2.21 0.127 1.75 -1.35 -1.48 0.951 -1.84 -0.338 Yes
#> 7 -0.413 2.83 -1.32 0.435 0.0343 -2.25 0.380 -3.26 -2.01 -3.09 -3.03 -0.826 -1.75 0.690 -0.0159 1.65 Yes
#> 8 -0.103 0.682 -0.0636 1.45 0.532 -0.165 1.21 -0.801 0.0620 -0.915 -0.554 -0.571 -0.908 0.931 -0.636 -1.33 Yes
#> 9 0.516 -0.508 -1.36 0.306 -0.809 -0.137 -0.911 0.549 -0.336 0.856 0.0909 0.581 1.19 -1.73 0.760 -0.209 No
#> 10 0.826 0.196 0.960 1.12 1.20 -0.550 -0.717 0.00667 -0.741 -0.0616 0.0909 1.43 0.373 -0.296 0.155 -0.204 No
#> # ℹ 5,675 more rows
#> # ℹ 3 more variables: tree_no_tree_No.tree <dbl>, land_type_Non.tree.vegetation <dbl>, land_type_Tree <dbl>
Tidying a recipe
Once a recipe as been estimated, there are various bits of information saved in it.
The tidy()
function can be used to get specific results from the recipe.
Your turn
Take a prepped recipe and use the tidy()
function on it.
Use the number
argument to inspect different steps.
Tidying a recipe
prep (forested_rec) %>%
tidy ()
#> # A tibble: 4 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step dummy TRUE FALSE dummy_jlmcG
#> 2 2 step zv TRUE FALSE zv_mYCvS
#> 3 3 step log TRUE FALSE log_eme6b
#> 4 4 step normalize TRUE FALSE normalize_ScVef
Tidying a recipe
prep (forested_rec) %>%
tidy (number = 1 )
#> # A tibble: 3 × 3
#> terms columns id
#> <chr> <chr> <chr>
#> 1 tree_no_tree No tree dummy_jlmcG
#> 2 land_type Non-tree vegetation dummy_jlmcG
#> 3 land_type Tree dummy_jlmcG
Using a recipe in tidymodels
The recommended way to use a recipe in tidymodels is to use it as part of a workflow()
.
forested_wflow <- workflow () %>%
add_recipe (forested_rec) %>%
add_model (linear_reg ())
When used in this way, you don’t need to worry about prep()
and bake()
as it is handled for you.