Extras - Recipes

Introduction to tidymodels

Looking at the predictors

forested_train
#> # A tibble: 8,749 × 19
#>    forested  year elevation eastness roughness tree_no_tree dew_temp precip_annual temp_annual_mean temp_annual_min temp_annual_max temp_january_min vapor_min vapor_max canopy_cover   lon   lat
#>    <fct>    <dbl>     <dbl>    <dbl>     <dbl> <fct>           <dbl>         <dbl>            <dbl>           <dbl>           <dbl>            <dbl>     <dbl>     <dbl>        <dbl> <dbl> <dbl>
#>  1 Yes       1997        66       82        10 Tree            12.2           1315             18.4            1.88            24.9            11.8        121      1888           66 -84.9  32.4
#>  2 No        1997       284      -99        58 Tree            10.3           1236             16.1           -0.26            22.3             9.92        68      1586           80 -85.0  34.1
#>  3 Yes       2022       130       86        15 Tree            11.8           1194             17.6            1.3             24.1            11.2         61      1753           96 -83.0  33.1
#>  4 Yes       2021       202      -55         3 Tree            10.7           1235             16.6            0.05            23.0            10.2         72      1682           65 -83.0  33.9
#>  5 Yes       1995        75      -89         1 Tree            13.8           1256             19.2            3.63            25.5            12.9         57      1796           88 -83.4  31.3
#>  6 No        1995       110      -53         5 Tree            12.4           1236             18.6            2.53            24.8            12.4        102      1835           51 -83.9  32.2
#>  7 Yes       2022       111       73        12 Tree            11.5           1168             17.4            1               24.0            10.9         67      1772           84 -82.2  33.6
#>  8 Yes       1997       230       96        14 Tree             9.98          1373             15.4           -1.35            21.8             9.03        46      1552           68 -85.3  34.8
#>  9 Yes       2002       160      -88        13 Tree            11.1           1219             16.9            0.07            23.6            10.2         53      1731           95 -83.2  33.6
#> 10 Yes       2020        39        9         6 Tree            13.9           1237             19.2            3.25            25.6            12.8         58      1812           86 -82.0  31.7
#> # ℹ 8,739 more rows
#> # ℹ 2 more variables: land_type <fct>, county <fct>

Working with other models

Some models can’t handle non-numeric data

  • Linear Regression
  • K Nearest Neighbors


Some models struggle if numeric predictors aren’t scaled

  • K Nearest Neighbors
  • Anything using gradient descent

Types of needed preprocessing

  • Do qualitative predictors require a numeric encoding?

  • Should columns with a single unique value be removed?

  • Does the model struggle with missing data?

  • Does the model struggle with correlated predictors?

  • Should predictors be centered and scaled?

  • Is it helpful to transform predictors to be more symmetric?

Two types of preprocessing

Two types of preprocessing

General definitions

  • Data preprocessing is what you do to make your model successful.
  • Feature engineering is what you do to the original predictors to make the model do the least work to perform great.

Working with dates

Datetime variables are automatically converted to an integer if given as a raw predictor. To avoid this, it can be re-encoded as:

  • Days since a reference date
  • Day of the week
  • Month
  • Year
  • Leap year
  • Indicators for holidays

Two types of transformations


Static

  • Square root, log, inverse
  • Dummies for known levels
  • Date time extractions

Trained

  • Centering & scaling
  • Imputation
  • PCA
  • Anything for unknown factor levels

Trained methods need to calculate sufficient information to be applied again.

The recipes package

  • Modular + extensible
  • Works well with pipes ,|> and %>%
  • Deferred evaluation
  • Isolates test data from training data
  • Can do things formulas can’t

How to write a recipe

forested_rec <- recipe(forested ~ ., data = forested_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(canopy_cover, offset = 0.5) %>%
  step_normalize(all_numeric_predictors())

How to write a recipe

forested_rec <- recipe(forested ~ ., data = forested_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(canopy_cover, offset = 0.5) %>%
  step_normalize(all_numeric_predictors())


Start by calling recipe() to denote the data source and variables used.

How to write a recipe

forested_rec <- recipe(forested ~ ., data = forested_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(canopy_cover, offset = 0.5) %>%
  step_normalize(all_numeric_predictors())


Specify what actions to take by adding step_*()s.

How to write a recipe

forested_rec <- recipe(forested ~ ., data = forested_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(canopy_cover, offset = 0.5) %>%   step_normalize(all_numeric_predictors())


Use {tidyselect} and recipes-specific selectors to denote affected variables.

Using a recipe

forested_rec <- recipe(forested ~ ., data = forested_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(canopy_cover, offset = 0.5) %>%   step_normalize(all_numeric_predictors())


Save the recipe we like so that we can use it in various places, e.g., with different models.


Using a recipe with workflows

Recipes are typically combined with a model in a workflow() object:


forested_wflow <- workflow() %>%
  add_recipe(forested_rec) %>%
  add_model(linear_reg())

Recipes are estimated

Every preprocessing step in a recipe that involved calculations uses the training set. For example:

  • Levels of a factor
  • Determination of zero-variance
  • Normalization
  • Feature extraction

Once a recipe is added to a workflow, this occurs when fit() is called.

Debugging a recipe

  • Typically, you will want to use a workflow to estimate and apply a recipe.
  • If you have an error and need to debug your recipe, the original recipe object (e.g. forested_rec) can be estimated manually with a function called prep(). It is analogous to fit(). See TMwR section 16.4.
  • Another function, bake(), is analogous to predict(), and gives you the processed data back.

Your turn


Take the recipe and prep() then bake() it to see what the resulting data set looks like.

Try removing steps to see how the result changes.


05:00

Printing a recipe

forested_rec
#> 
#> ── Recipe ────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 18
#> 
#> ── Operations
#> • Dummy variables from: all_nominal_predictors()
#> • Zero variance filter on: all_predictors()
#> • Log transformation on: canopy_cover
#> • Centering and scaling for: all_numeric_predictors()

Prepping a recipe

prep(forested_rec)
#> 
#> ── Recipe ────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 18
#> 
#> ── Training information
#> Training data contained 8749 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variables from: tree_no_tree, land_type, county | Trained
#> • Zero variance filter removed: <none> | Trained
#> • Log transformation on: canopy_cover | Trained
#> • Centering and scaling for: year elevation, ... | Trained

Baking a recipe

prep(forested_rec) %>%
  bake(new_data = forested_train)
#> # A tibble: 8,749 × 177
#>      year elevation eastness roughness dew_temp precip_annual temp_annual_mean temp_annual_min temp_annual_max temp_january_min vapor_min vapor_max canopy_cover     lon     lat forested
#>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>         <dbl>            <dbl>           <dbl>           <dbl>            <dbl>     <dbl>     <dbl>        <dbl>   <dbl>   <dbl> <fct>   
#>  1 -1.15    -0.521    1.13      0.0463  -0.232          0.178            0.172          -0.107           0.338         -0.00909    3.10     1.01          0.407  -1.44   -0.0184 Yes     
#>  2 -1.15     1.10    -1.50      3.91    -1.39          -0.568           -1.31           -1.33           -1.28          -1.28       0.0408  -0.941         0.652  -1.51    1.36   No      
#>  3  0.892   -0.0440   1.19      0.449   -0.422         -0.965           -0.305          -0.438          -0.170         -0.438     -0.363    0.140         0.884   0.331   0.556  Yes     
#>  4  0.810    0.493   -0.858    -0.518   -1.11          -0.577           -1.01           -1.15           -0.878         -1.10       0.272   -0.320         0.388   0.331   1.26   Yes     
#>  5 -1.31    -0.454   -1.35     -0.679    0.748         -0.379            0.701           0.893           0.683          0.702     -0.594    0.418         0.773  -0.0643 -0.992  Yes     
#>  6 -1.31    -0.193   -0.829    -0.357   -0.0546        -0.568            0.331           0.264           0.281          0.366      2.00     0.670         0.0802 -0.468  -0.237  No      
#>  7  0.892   -0.186    1.00      0.207   -0.631         -1.21            -0.444          -0.609          -0.264         -0.619     -0.0169   0.262         0.714   1.01    0.929  Yes     
#>  8 -1.15     0.702    1.34      0.369   -1.56           0.726           -1.77           -1.95           -1.60          -1.87      -1.23    -1.16          0.445  -1.77    2.03   Yes     
#>  9 -0.740    0.180   -1.34      0.288   -0.858         -0.729           -0.801          -1.14           -0.515         -1.08      -0.825   -0.00273       0.870   0.122   0.988  Yes     
#> 10  0.729   -0.722    0.0719   -0.276    0.816         -0.559            0.708           0.676           0.745          0.635     -0.536    0.521         0.744   1.22   -0.653  Yes     
#> # ℹ 8,739 more rows
#> # ℹ 161 more variables: tree_no_tree_No.tree <dbl>, land_type_Non.tree.vegetation <dbl>, land_type_Tree <dbl>, county_Atkinson <dbl>, county_Bacon <dbl>, county_Baker <dbl>, county_Baldwin <dbl>,
#> #   county_Banks <dbl>, county_Barrow <dbl>, county_Bartow <dbl>, county_Ben.Hill <dbl>, county_Berrien <dbl>, county_Bibb <dbl>, county_Bleckley <dbl>, county_Brantley <dbl>, county_Brooks <dbl>,
#> #   county_Bryan <dbl>, county_Bulloch <dbl>, county_Burke <dbl>, county_Butts <dbl>, county_Calhoun <dbl>, county_Camden <dbl>, county_Candler <dbl>, county_Carroll <dbl>, county_Catoosa <dbl>,
#> #   county_Charlton <dbl>, county_Chatham <dbl>, county_Chattahoochee <dbl>, county_Chattooga <dbl>, county_Cherokee <dbl>, county_Clarke <dbl>, county_Clay <dbl>, county_Clayton <dbl>,
#> #   county_Clinch <dbl>, county_Cobb <dbl>, county_Coffee <dbl>, county_Colquitt <dbl>, county_Columbia <dbl>, county_Cook <dbl>, county_Coweta <dbl>, county_Crawford <dbl>, county_Crisp <dbl>,
#> #   county_Dade <dbl>, county_Dawson <dbl>, county_Decatur <dbl>, county_DeKalb <dbl>, county_Dodge <dbl>, county_Dooly <dbl>, county_Dougherty <dbl>, county_Douglas <dbl>, county_Early <dbl>, …

Tidying a recipe

Once a recipe as been estimated, there are various bits of information saved in it.

  • The tidy() function can be used to get specific results from the recipe.

Your turn

Take a prepped recipe and use the tidy() function on it.

Use the number argument to inspect different steps.


05:00

Tidying a recipe

prep(forested_rec) %>%
  tidy()
#> # A tibble: 4 × 6
#>   number operation type      trained skip  id             
#>    <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
#> 1      1 step      dummy     TRUE    FALSE dummy_hIEnQ    
#> 2      2 step      zv        TRUE    FALSE zv_ZrQBx       
#> 3      3 step      log       TRUE    FALSE log_6es7X      
#> 4      4 step      normalize TRUE    FALSE normalize_XsIxb

Tidying a recipe

prep(forested_rec) %>%
  tidy(number = 1)
#> # A tibble: 161 × 3
#>    terms        columns             id         
#>    <chr>        <chr>               <chr>      
#>  1 tree_no_tree No tree             dummy_hIEnQ
#>  2 land_type    Non-tree vegetation dummy_hIEnQ
#>  3 land_type    Tree                dummy_hIEnQ
#>  4 county       Atkinson            dummy_hIEnQ
#>  5 county       Bacon               dummy_hIEnQ
#>  6 county       Baker               dummy_hIEnQ
#>  7 county       Baldwin             dummy_hIEnQ
#>  8 county       Banks               dummy_hIEnQ
#>  9 county       Barrow              dummy_hIEnQ
#> 10 county       Bartow              dummy_hIEnQ
#> # ℹ 151 more rows

Using a recipe in tidymodels

The recommended way to use a recipe in tidymodels is to use it as part of a workflow().

forested_wflow <- workflow() %>%  
  add_recipe(forested_rec) %>%  
  add_model(linear_reg())

When used in this way, you don’t need to worry about prep() and bake() as it is handled for you.

More information