5 - Feature engineering

Machine learning with tidymodels

Working with our predictors

We might want to modify our predictors columns for a few reasons:

The model requires them in a different format (e.g. dummy variables for lm()).
The model needs certain data qualities (e.g. same units for K-NN).
The outcome is better predicted when one or more columns are transformed in some way (a.k.a “feature engineering”).

The first two reasons are fairly predictable (next page).

The last one depends on your modeling problem.

What is feature engineering?

Think of a feature as some representation of a predictor that will be used in a model.

Example representations:

Interactions
Polynomial expansions/splines
PCA feature extraction

There are a lot of examples in Feature Engineering and Selection.

Example: Dates

How can we represent date columns for our model?

When a date column is used in its native format, it is usually converted by an R model to an integer.

It can be re-engineered as:

Days since a reference date
Day of the week
Month
Year
Indicators for holidays

General definitions

Data preprocessing steps allow your model to fit.
Feature engineering steps help the model do the least work to predict the outcome as well as possible.

The recipes package can handle both!

In a little bit, we’ll see successful (and unsuccessful) feature engineering methods for our example data.

The NHL data 🏒

From Pittsburgh Penguins games, 8,915 shots
Data from the 2015-2016 season

Let’s predict whether a shot is on-goal (a goal or blocked by goaltender) or not.

Case study

library(tidymodels)
library(ongoal)

tidymodels_prefer()

glimpse(season_2015)
#> Rows: 8,915
#> Columns: 13
#> $ on_goal         <fct> no, no, yes, no, yes, no, no, yes, no, no, no, yes, no…
#> $ period          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ game_seconds    <dbl> 13, 36, 47, 92, 99, 125, 179, 220, 252, 309, 413, 427,…
#> $ strength        <fct> even, even, even, even, even, even, even, even, even, …
#> $ home_skaters    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
#> $ away_skaters    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
#> $ goaltender      <fct> marc_andre_fleury, antti_niemi, antti_niemi, antti_nie…
#> $ goal_difference <dbl> 0, 0, 0, 0, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, 1,…
#> $ shooter         <fct> evgeni_malkin, valeri_nichushkin, phil_kessel, beau_be…
#> $ shooter_type    <fct> center, right_wing, center, right_wing, center, defens…
#> $ coord_x         <dbl> -66, -49, 64, 65, 80, 42, -55, 62, -67, -58, 76, 56, -…
#> $ coord_y         <dbl> -11, -21, -31, -21, 13, 31, -19, 15, -9, -16, -8, 25, …
#> $ extra_attacker  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Data splitting strategy

Why a validation set?

Recall that resampling gives us performance measures without using the test set.

It’s important to get good resampling statistics (e.g. \(R^2\)).

That usually means having enough data to estimate performance.

When you have “a lot” of data, a validation set can be an efficient way to do this.

Splitting the NHL data

set.seed(23)
nhl_split <- initial_split(season_2015, prop = 3/4)
nhl_split
#> <Training/Testing/Total>
#> <6686/2229/8915>

nhl_train_and_val <- training(nhl_split)
nhl_test  <- testing(nhl_split)

## not testing
nrow(nhl_train_and_val)
#> [1] 6686
 
## testing
nrow(nhl_test)
#> [1] 2229

Validation split

Since there are a lot of observations, we’ll use a validation set:

set.seed(234)
nhl_val <- validation_split(nhl_train_and_val, prop = 0.80)
nhl_val
#> # Validation Set Split (0.8/0.2)  
#> # A tibble: 1 × 2
#>   splits              id        
#>   <list>              <chr>     
#> 1 <split [5348/1338]> validation

Remember that a validation split is a type of resample.

Your turn

Let’s explore the training set data.

Use the function plot_nhl_shots() for nice spatial plots of the data.

nhl_train <- analysis(nhl_val$splits[[1]])

set.seed(100)
nhl_train %>% 
  sample_n(200) %>%
  plot_nhl_shots(emphasis = shooter_type)

08:00

Prepare your data for modeling

The recipes package is an extensible framework for pipeable sequences of feature engineering steps that provide preprocessing tools to be applied to data.

Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.

The resulting processed output can be used as inputs for statistical or machine learning models.

A first recipe

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train)

The recipe() function assigns columns to roles of “outcome” or “predictor” using the formula

A first recipe

summary(nhl_rec)
#> # A tibble: 13 × 4
#>    variable        type      role      source  
#>    <chr>           <list>    <chr>     <chr>   
#>  1 period          <chr [2]> predictor original
#>  2 game_seconds    <chr [2]> predictor original
#>  3 strength        <chr [3]> predictor original
#>  4 home_skaters    <chr [2]> predictor original
#>  5 away_skaters    <chr [2]> predictor original
#>  6 goaltender      <chr [3]> predictor original
#>  7 goal_difference <chr [2]> predictor original
#>  8 shooter         <chr [3]> predictor original
#>  9 shooter_type    <chr [3]> predictor original
#> 10 coord_x         <chr [2]> predictor original
#> 11 coord_y         <chr [2]> predictor original
#> 12 extra_attacker  <chr [2]> predictor original
#> 13 on_goal         <chr [3]> outcome   original

Create indicator variables

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors())

For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.

Filter out constant columns

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())

In case there is a factor level that was never observed in the training data (resulting in a column of all 0s), we can delete any zero-variance predictors that have a single unique value.

Normalization

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors())

This centers and scales the numeric predictors.
The recipe will use the training set to estimate the means and standard deviations of the data.

All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).

Reduce correlation

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.9)

To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.

Other possible steps

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors())

PCA feature extraction…

Other possible steps

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  embed::step_umap(all_numeric_predictors(), outcome = on_goal)

A fancy machine learning supervised dimension reduction technique…

Other possible steps

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_ns(coord_y, coord_x, deg_free = 10)

Nonlinear transforms like natural splines, and so on!

Your turn

Create a recipe() for the on-goal data to :

create one-hot indicator variables
remove zero-variance variables

03:00

Minimal recipe

nhl_indicators <-
  recipe(on_goal ~ ., data = nhl_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

Using a workflow

set.seed(9)

nhl_glm_wflow <-
  workflow() %>%
  add_recipe(nhl_indicators) %>%
  add_model(logistic_reg())
 
ctrl <- control_resamples(save_pred = TRUE)
nhl_glm_res <-
  nhl_glm_wflow %>%
  fit_resamples(nhl_val, control = ctrl)

collect_metrics(nhl_glm_res)
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.756     1      NA Preprocessor1_Model1
#> 2 roc_auc  binary     0.753     1      NA Preprocessor1_Model1

Your turn

Use fit_resamples() to fit your workflow with a recipe.

Collect the predictions from the results.

05:00

Holdout predictions

# Since we used `save_pred = TRUE`
glm_val_pred <- collect_predictions(nhl_glm_res)
glm_val_pred %>% slice(1:7)
#> # A tibble: 7 × 7
#>   id             .pred_yes .pred_no  .row .pred_class on_goal .config           
#>   <chr>              <dbl>    <dbl> <int> <fct>       <fct>   <chr>             
#> 1 validation 0.00844          0.992     2 no          no      Preprocessor1_Mod…
#> 2 validation 0.0864           0.914    10 no          no      Preprocessor1_Mod…
#> 3 validation 0.799            0.201    22 yes         yes     Preprocessor1_Mod…
#> 4 validation 0.615            0.385    24 yes         yes     Preprocessor1_Mod…
#> 5 validation 0.787            0.213    31 yes         yes     Preprocessor1_Mod…
#> 6 validation 0.872            0.128    39 yes         yes     Preprocessor1_Mod…
#> 7 validation 0.00000000903    1.00     40 no          no      Preprocessor1_Mod…

Two class data

Let’s say we can define one class as the “event”, like a shot being on goal.

The sensitivity is the true positive rate (accuracy on actual events).
The specificity is the true negative rate (accuracy on actual non-events, or 1 - false positive rate).

Two class data

These definitions assume that we know the threshold for converting “soft” probability predictions into “hard” class predictions.

Is a 50% threshold good?

What happens if we say that we need to be 80% sure to declare an event?

sensitivity ⬇️, specificity ⬆️

What happens for a 20% threshold?

sensitivity ⬆️, specificity ⬇️

Varying the threshold

ROC curves

To make an ROC (receiver operator characteristic) curve, we:

calculate the sensitivity and specificity for all possible thresholds
plot false positive rate (x-axis) versus true positive rate (y-axis)

We can use the area under the ROC curve as a classification metric:

ROC AUC = 1 💯
ROC AUC = 1/2 😢

ROC curves

# Assumes _first_ factor level is event; there are options to change that
roc_curve_points <- glm_val_pred %>% roc_curve(truth = on_goal, .pred_yes)
roc_curve_points %>% slice(1, 50, 100)
#> # A tibble: 3 × 3
#>   .threshold specificity sensitivity
#>        <dbl>       <dbl>       <dbl>
#> 1 -Inf            0            1    
#> 2    0.00143      0.0412       0.968
#> 3    0.00988      0.127        0.968

glm_val_pred %>% roc_auc(truth = on_goal, .pred_yes)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.753

ROC curve plot

autoplot(roc_curve_points)

Your turn

Compute and plot an ROC curve for your current model.

What data are being used for this ROC curve plot?

05:00

What do we do with the player data? 🏒

There are 598 unique player values in our training set. How can we include this information in our model?

We could:

make the full set of indicator variables 😳
lump players who rarely shoot into an “other” group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the shooter column with the estimated effect of that predictor

Let’s look at othering then effect encodings.

Per-player statistics

Collapsing factor levels

There is a recipe step that will redefine factor levels based on the their frequency in the training set:

nhl_other_rec <-
  recipe(on_goal ~ ., data = nhl_train) %>%
  # Any player with <= 0.01% of shots is set to "other"
  step_other(shooter, threshold = 0.001) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

Using this code, 402 players (out of 598) were collapsed into “other” based on the training set.

We could try to optimize the threshold for collapsing (see the next set of slides on model tuning).

Does othering help?

nhl_other_wflow <-
  nhl_glm_wflow %>%
  update_recipe(nhl_other_rec)

nhl_other_res <-
  nhl_other_wflow %>%
  fit_resamples(nhl_val, control = ctrl)

collect_metrics(nhl_other_res)
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.778     1      NA Preprocessor1_Model1
#> 2 roc_auc  binary     0.804     1      NA Preprocessor1_Model1

A little better ROC AUC and much faster to complete.

Now let’s look at a more sophisticated tool called effect encodings.

What is an effect encoding?

We replace the qualitative’s predictor data with their effect on the outcome.

Data before:

before
#> # A tibble: 7 × 3
#>   on_goal shooter        .row
#>   <fct>   <fct>         <int>
#> 1 yes     sidney_crosby     1
#> 2 yes     mike_hoffman      2
#> 3 yes     ian_cole          3
#> 4 yes     matt_cullen       4
#> 5 yes     kris_letang       5
#> 6 no      nazem_kadri       6
#> 7 yes     david_perron      7

Data after:

after
#> # A tibble: 7 × 3
#>   on_goal shooter  .row
#>   <fct>     <dbl> <int>
#> 1 yes       0.220     1
#> 2 yes       0.302     2
#> 3 yes       0.261     3
#> 4 yes       0.463     4
#> 5 yes       0.197     5
#> 6 no        0.302     6
#> 7 yes       0.400     7

The shooter column is replaced with the log-odds of being on goal.

Per-player statistics again

Good statistical methods for estimating these rates use partial pooling.
Pooling borrows strength across players and shrinks extreme values (e.g. zero or one) towards the mean for players with very few shots.
The embed package has recipe steps for effect encodings.

Partial pooling

Player effects

library(embed)

nhl_effect_rec <-
  recipe(on_goal ~ ., data = nhl_train) %>%
  step_lencode_mixed(shooter, goaltender, outcome = vars(on_goal)) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

It is very important to appropriately validate the effect encoding step to make sure that we are not overfitting.

Recipes are estimated

Preprocessing steps in a recipe use the training set to compute quantities.

What kind of quantities are computed for preprocessing?

Levels of a factor
Whether a column has zero variance
Normalization
Feature extraction
Effect encodings

When a recipe is part of a workflow, this estimation occurs when fit() is called.

Effect encoding results

nhl_effect_wflow <-
  nhl_glm_wflow %>%
  update_recipe(nhl_effect_rec)

nhl_effect_res <-
  nhl_effect_wflow %>%
  fit_resamples(nhl_val, control = ctrl)

collect_metrics(nhl_effect_res)
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.791     1      NA Preprocessor1_Model1
#> 2 roc_auc  binary     0.805     1      NA Preprocessor1_Model1

Better and it can handle new players (if they occur).

Where is the shot coming from? 🏒🧐

Angle

nhl_angle_rec <-
  nhl_effect_rec %>%
  step_mutate(
    angle = abs( atan2(abs(coord_y), (89 - coord_x) ) * (180 / pi) )
  )

Shot from the defensive zone

nhl_zone_rec <-
  nhl_angle_rec %>%
  step_mutate(
    defensive_zone = ifelse(coord_x <= -25.5, 1, 0)
  )

Behind goal line

nhl_behind_rec <-
  nhl_zone_rec %>%
  step_mutate(
    behind_goal_line = ifelse(coord_x >= 89, 1, 0)
  )

Fit different recipes

A workflow set can cross models and/or preprocessors and then resample them en masse.

no_coord_rec <- 
  nhl_indicators %>% 
  step_rm(starts_with("coord"))

set.seed(9)

nhl_glm_set_res <-
  workflow_set(
    list(`1_no_coord` = no_coord_rec,   `2_other` = nhl_other_rec, 
         `3_effects`  = nhl_effect_rec, `4_angle` = nhl_angle_rec, 
         `5_zone`     = nhl_zone_rec,   `6_bgl`   = nhl_behind_rec),
    list(logistic = logistic_reg())
  ) %>%
  workflow_map(fn = "fit_resamples", resamples = nhl_val, verbose = TRUE, control = ctrl)

Your turn

Create a workflow set with 2 or 3 recipes.

(Consider using recipes we’ve already created.)

Use workflow_map() to resample the workflow set.

08:00

Compare recipes

library(forcats)
collect_metrics(nhl_glm_set_res) %>%
  filter(.metric == "roc_auc") %>%
  mutate(
    features = gsub("_logistic", "", wflow_id), 
    features = fct_reorder(features, mean)
  ) %>%
  ggplot(aes(x = mean, y = features)) +
  geom_point(size = 3) +
  labs(y = NULL, x = "ROC AUC (validation set)")

Compare recipes

Debugging a recipe

Typically, you will want to use a workflow to estimate and apply a recipe.

If you have an error and need to debug your recipe, the original recipe object (e.g. encoded_players) can be estimated manually with a function called prep(). It is analogous to fit(). See TMwR section 16.4

Another function (bake()) is analogous to predict(), and gives you the processed data back.

The tidy() function can be used to get specific results from the recipe.

Example

nhl_angle_fit <- prep(nhl_angle_rec)

tidy(nhl_angle_fit, number = 1) %>% slice(1:4)
#> # A tibble: 4 × 4
#>   level           value terms   id                 
#>   <chr>           <dbl> <chr>   <chr>              
#> 1 aaron_ekblad    0.250 shooter lencode_mixed_qBdjK
#> 2 adam_clendening 0.202 shooter lencode_mixed_qBdjK
#> 3 adam_cracknell  0.288 shooter lencode_mixed_qBdjK
#> 4 adam_henrique   0.221 shooter lencode_mixed_qBdjK

bake(nhl_angle_fit, nhl_train %>% slice(1:3), starts_with("coord"), angle, shooter)
#> # A tibble: 3 × 4
#>   coord_x coord_y angle shooter
#>     <dbl>   <dbl> <dbl>   <dbl>
#> 1      67     -37  59.3   0.220
#> 2      55      -7  11.6   0.302
#> 3      42     -18  21.0   0.261

5 - Feature engineering

Working with our predictors

What is feature engineering?

Example: Dates

General definitions

The NHL data 🏒

Case study

Data splitting strategy

Why a validation set?

Splitting the NHL data

Validation split

Your turn

Prepare your data for modeling

A first recipe

A first recipe

Create indicator variables

Filter out constant columns

Normalization

Reduce correlation

Other possible steps

Other possible steps

Other possible steps

Your turn

Minimal recipe

Using a workflow

Your turn

Holdout predictions

Two class data

Two class data

Varying the threshold

ROC curves

ROC curves

ROC curve plot

Your turn

What do we do with the player data? 🏒

Per-player statistics

Collapsing factor levels

Does othering help?

What is an effect encoding?

Per-player statistics again

Partial pooling

Player effects

Recipes are estimated

Effect encoding results

Where is the shot coming from? 🏒🧐

Angle

Shot from the defensive zone

Behind goal line

Fit different recipes

Your turn

Compare recipes

Compare recipes

Debugging a recipe

Example

More on recipes