Extras - Recipes

Introduction to tidymodels

Looking at the predictors

taxi_train
#> # A tibble: 8,000 × 7
#>    tip   distance company                      local dow   month  hour
#>    <fct>    <dbl> <fct>                        <fct> <fct> <fct> <int>
#>  1 yes      17.2  Chicago Independents         no    Thu   Feb      16
#>  2 yes       0.88 City Service                 yes   Thu   Mar       8
#>  3 yes      18.1  other                        no    Mon   Feb      18
#>  4 yes      12.2  Chicago Independents         no    Sun   Mar      21
#>  5 yes       0.94 Sun Taxi                     yes   Sat   Apr      23
#>  6 yes      17.5  Flash Cab                    no    Fri   Mar      12
#>  7 yes      17.7  other                        no    Sun   Jan       6
#>  8 yes       1.85 Taxicab Insurance Agency Llc no    Fri   Apr      12
#>  9 yes       0.53 Sun Taxi                     no    Tue   Mar      18
#> 10 yes       6.65 Taxicab Insurance Agency Llc no    Sun   Apr      11
#> # ℹ 7,990 more rows

Working with other models

Some models can’t handle non-numeric data

  • Linear Regression
  • K Nearest Neighbors


Some models struggle if numeric predictors aren’t scaled

  • K Nearest Neighbors
  • Anything using gradient descent

Types of needed preprocessing

  • Do qualitative predictors require a numeric encoding?

  • Should columns with a single unique value be removed?

  • Does the model struggle with missing data?

  • Does the model struggle with correlated predictors?

  • Should predictors be centered and scaled?

  • Is it helpful to transform predictors to be more symmetric?

Two types of preprocessing

Two types of preprocessing

General definitions

  • Data preprocessing are the steps that you take to make your model successful.
  • Feature engineering are what you do to the original predictors to make the model do the least work to perform great.

Working with dates

Datetime variables are automatically converted to an integer if given as a raw predictor. To avoid this, it can be re-encoded as:

  • Days since a reference date
  • Day of the week
  • Month
  • Year
  • Leap year
  • Indicators for holidays

Your turn


What other transformations could we do with the raw time variable?

Remember that the transformations are tied to the specific modeling problem.

03:00

Two types of transformations


Static

  • Square root, log, inverse
  • Dummies for known levels
  • Date time extractions

Trained

  • Centering & scaling
  • Imputation
  • PCA
  • Anything for unknown factor levels

Trained methods need to calculate sufficient information to be applied again.

The recipes package

  • Modular + extensible
  • Works well with pipes ,|> and %>%
  • Deferred evaluation
  • Isolates test data from training data
  • Can do things formulas can’t

How to write a recipe

taxi_rec <- recipe(tip ~ ., data = taxi_train) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(distance, offset = 0.5) %>%
  step_normalize(all_numeric_predictors())

How to write a recipe

taxi_rec <- recipe(tip ~ ., data = taxi_train) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(distance, offset = 0.5) %>%
  step_normalize(all_numeric_predictors())


Start by calling recipe() to denote the data source and variables used.

How to write a recipe

taxi_rec <- recipe(tip ~ ., data = taxi_train) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(distance, offset = 0.5) %>%
  step_normalize(all_numeric_predictors())


Specify what actions to take by adding step_*()s.

How to write a recipe

taxi_rec <- recipe(tip ~ ., data = taxi_train) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(distance, offset = 0.5) %>%   step_normalize(all_numeric_predictors())


Use {tidyselect} and recipes-specific selectors to denote affected variables.

Using a recipe

taxi_rec <- recipe(tip ~ ., data = taxi_train) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_log(distance, offset = 0.5) %>%   step_normalize(all_numeric_predictors())


Save the recipe we like so that we can use it in various places, e.g., with different models.


Using a recipe with workflows

Recipes are typically combined with a model in a workflow() object:


taxi_wflow <- workflow() %>%
  add_recipe(taxi_rec) %>%
  add_model(linear_reg())

Recipes are estimated

Every preprocessing step in a recipe that involved calculations uses the training set. For example:

  • Levels of a factor
  • Determination of zero-variance
  • Normalization
  • Feature extraction

Once a recipe is added to a workflow, this occurs when fit() is called.

Debugging a recipe

  • Typically, you will want to use a workflow to estimate and apply a recipe.
  • If you have an error and need to debug your recipe, the original recipe object (e.g. taxi_rec) can be estimated manually with a function called prep(). It is analogous to fit(). See TMwR section 16.4.
  • Another function, bake(), is analogous to predict(), and gives you the processed data back.

Your turn


Take the recipe and prep() then bake() it to see what the resulting data set looks like.

Try removing steps to see how the result changes.


05:00

Printing a recipe

taxi_rec
#> 
#> ── Recipe ────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 6
#> 
#> ── Operations
#> • Unknown factor level assignment for: all_nominal_predictors()
#> • Dummy variables from: all_nominal_predictors()
#> • Zero variance filter on: all_predictors()
#> • Log transformation on: distance
#> • Centering and scaling for: all_numeric_predictors()

Prepping a recipe

prep(taxi_rec)
#> 
#> ── Recipe ────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 6
#> 
#> ── Training information
#> Training data contained 8000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Unknown factor level assignment for: company, ... | Trained
#> • Dummy variables from: company, local, dow, month | Trained
#> • Zero variance filter removed: company_unknown, ... | Trained
#> • Log transformation on: distance | Trained
#> • Centering and scaling for: distance and hour, ... | Trained

Baking a recipe

prep(taxi_rec) %>%
  bake(new_data = taxi_train)
#> # A tibble: 8,000 × 19
#>    distance   hour tip   company_City.Service company_Flash.Cab company_Sun.Taxi company_Taxi.Affiliatio…¹ company_Taxicab.Insu…² company_other local_no dow_Mon dow_Tue dow_Wed dow_Thu dow_Fri dow_Sat
#>       <dbl>  <dbl> <fct>                <dbl>             <dbl>            <dbl>                     <dbl>                  <dbl>         <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1    1.38   0.418 yes                 -0.366            -0.333           -0.403                    -0.450                 -0.379        -0.609    0.484  -0.396  -0.441  -0.461   2.01   -0.428  -0.317
#>  2   -0.729 -1.42  yes                  2.73             -0.333           -0.403                    -0.450                 -0.379        -0.609   -2.07   -0.396  -0.441  -0.461   2.01   -0.428  -0.317
#>  3    1.42   0.877 yes                 -0.366            -0.333           -0.403                    -0.450                 -0.379         1.64     0.484   2.53   -0.441  -0.461  -0.497  -0.428  -0.317
#>  4    1.11   1.57  yes                 -0.366            -0.333           -0.403                    -0.450                 -0.379        -0.609    0.484  -0.396  -0.441  -0.461  -0.497  -0.428  -0.317
#>  5   -0.694  2.03  yes                 -0.366            -0.333            2.48                     -0.450                 -0.379        -0.609   -2.07   -0.396  -0.441  -0.461  -0.497  -0.428   3.15 
#>  6    1.39  -0.502 yes                 -0.366             3.01            -0.403                    -0.450                 -0.379        -0.609    0.484  -0.396  -0.441  -0.461  -0.497   2.34   -0.317
#>  7    1.40  -1.88  yes                 -0.366            -0.333           -0.403                    -0.450                 -0.379         1.64     0.484  -0.396  -0.441  -0.461  -0.497  -0.428  -0.317
#>  8   -0.289 -0.502 yes                 -0.366            -0.333           -0.403                    -0.450                  2.64         -0.609    0.484  -0.396  -0.441  -0.461  -0.497   2.34   -0.317
#>  9   -0.971  0.877 yes                 -0.366            -0.333            2.48                     -0.450                 -0.379        -0.609    0.484  -0.396   2.27   -0.461  -0.497  -0.428  -0.317
#> 10    0.631 -0.732 yes                 -0.366            -0.333           -0.403                    -0.450                  2.64         -0.609    0.484  -0.396  -0.441  -0.461  -0.497  -0.428  -0.317
#> # ℹ 7,990 more rows
#> # ℹ abbreviated names: ¹​company_Taxi.Affiliation.Services, ²​company_Taxicab.Insurance.Agency.Llc
#> # ℹ 3 more variables: month_Feb <dbl>, month_Mar <dbl>, month_Apr <dbl>

Tidying a recipe

Once a recipe as been estimated, there are various bits of information saved in it.

  • The tidy() function can be used to get specific results from the recipe.

Your turn

Take a prepped recipe and use the tidy() function on it.

Use the number argument to inspect different steps.


05:00

Tidying a recipe

prep(taxi_rec) %>%
  tidy()
#> # A tibble: 5 × 6
#>   number operation type      trained skip  id             
#>    <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
#> 1      1 step      unknown   TRUE    FALSE unknown_NTmu5  
#> 2      2 step      dummy     TRUE    FALSE dummy_cT3Uy    
#> 3      3 step      zv        TRUE    FALSE zv_z22dk       
#> 4      4 step      log       TRUE    FALSE log_QQ1iw      
#> 5      5 step      normalize TRUE    FALSE normalize_3cTJb

Tidying a recipe

prep(taxi_rec) %>%
  tidy(number = 2)
#> # A tibble: 20 × 3
#>    terms   columns                      id         
#>    <chr>   <chr>                        <chr>      
#>  1 company City Service                 dummy_cT3Uy
#>  2 company Flash Cab                    dummy_cT3Uy
#>  3 company Sun Taxi                     dummy_cT3Uy
#>  4 company Taxi Affiliation Services    dummy_cT3Uy
#>  5 company Taxicab Insurance Agency Llc dummy_cT3Uy
#>  6 company other                        dummy_cT3Uy
#>  7 company unknown                      dummy_cT3Uy
#>  8 local   no                           dummy_cT3Uy
#>  9 local   unknown                      dummy_cT3Uy
#> 10 dow     Mon                          dummy_cT3Uy
#> 11 dow     Tue                          dummy_cT3Uy
#> 12 dow     Wed                          dummy_cT3Uy
#> 13 dow     Thu                          dummy_cT3Uy
#> 14 dow     Fri                          dummy_cT3Uy
#> 15 dow     Sat                          dummy_cT3Uy
#> 16 dow     unknown                      dummy_cT3Uy
#> 17 month   Feb                          dummy_cT3Uy
#> 18 month   Mar                          dummy_cT3Uy
#> 19 month   Apr                          dummy_cT3Uy
#> 20 month   unknown                      dummy_cT3Uy

Using a recipe in tidymodels

The recommended way to use a recipe in tidymodels is to use it as part of a workflow().

taxi_wflow <- workflow() %>%  
  add_recipe(taxi_rec) %>%  
  add_model(linear_reg())

When used in this way, you don’t need to worry about prep() and bake() as it is handled for you.

More information