5 - Feature engineering

Machine learning with tidymodels

Working with our predictors

We might want to modify our predictors columns for a few reasons:

  • The model requires them in a different format (e.g. dummy variables for lm()).
  • The model needs certain data qualities (e.g. same units for K-NN).
  • The outcome is better predicted when one or more columns are transformed in some way (a.k.a “feature engineering”).

The first two reasons are fairly predictable (next page).

The last one depends on your modeling problem.

What is feature engineering?

Think of a feature as some representation of a predictor that will be used in a model.

Example representations:

  • Interactions
  • Polynomial expansions/splines
  • PCA feature extraction

There are a lot of examples in Feature Engineering and Selection.

Example: Dates

How can we represent date columns for our model?

When a date column is used in its native format, it is usually converted by an R model to an integer.

It can be re-engineered as:

  • Days since a reference date
  • Day of the week
  • Month
  • Year
  • Indicators for holidays

General definitions

  • Data preprocessing steps allow your model to fit.

  • Feature engineering steps help the model do the least work to predict the outcome as well as possible.

The recipes package can handle both!

In a little bit, we’ll see successful (and unsuccessful) feature engineering methods for our example data.

The NHL data 🏒

  • From Pittsburgh Penguins games, 12,147 shots

  • Data from the 2015-2016 season

Let’s predict whether a shot is on-goal (a goal or blocked by goaltender) or not.

Case study

library(tidymodels)
library(ongoal)

tidymodels_prefer()

glimpse(season_2015)
#> Rows: 12,147
#> Columns: 17
#> $ on_goal           <fct> yes, no, no, yes, no, no, yes, no, yes, no, no, no, …
#> $ period            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ period_type       <fct> regular, regular, regular, regular, regular, regular…
#> $ coord_x           <dbl> -53, 68, -42, -77, -67, 55, 77, 62, 59, 76, 44, 62, …
#> $ coord_y           <dbl> -18, -12, -18, 9, -5, -12, 13, 14, -5, -6, 7, -2, -2…
#> $ game_time         <dbl> 0.300000, 0.900000, 1.250000, 1.783333, 2.050000, 3.…
#> $ strength          <fct> even, even, even, even, even, even, even, even, even…
#> $ player            <fct> victor_hedman, evgeni_malkin, jason_garrison, ondrej…
#> $ player_diff       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ offense_team      <fct> TBL, PIT, TBL, TBL, TBL, PIT, PIT, PIT, PIT, PIT, PI…
#> $ defense_team      <fct> PIT, TBL, PIT, PIT, PIT, TBL, TBL, TBL, TBL, TBL, TB…
#> $ offense_goal_diff <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ game_type         <fct> playoff, playoff, playoff, playoff, playoff, playoff…
#> $ position          <fct> defenseman, center, defenseman, left_wing, defensema…
#> $ dow               <fct> Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sa…
#> $ month             <fct> May, May, May, May, May, May, May, May, May, May, Ma…
#> $ year              <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…

Data splitting strategy

Why a validation set?

Recall that resampling gives us performance measures without using the test set.

It’s important to get good resampling statistics (e.g. \(R^2\)).

  • That usually means having enough data to estimate performance.

When you have “a lot” of data, a validation set can be an efficient way to do this.

Splitting the NHL data

set.seed(23)
nhl_split <- initial_split(season_2015, prop = 3/4)
nhl_split
#> <Training/Testing/Total>
#> <9110/3037/12147>

nhl_train_and_val <- training(nhl_split)
nhl_test  <- testing(nhl_split)

## not testing
nrow(nhl_train_and_val)
#> [1] 9110
 
## testing
nrow(nhl_test)
#> [1] 3037

Validation split

Since there are a lot of observations, we’ll use a validation set:

set.seed(234)
nhl_val <- validation_split(nhl_train_and_val, prop = 0.80)
nhl_val
#> # Validation Set Split (0.8/0.2)  
#> # A tibble: 1 × 2
#>   splits              id        
#>   <list>              <chr>     
#> 1 <split [7288/1822]> validation

Remember that a validation split is a type of resample.

Your turn

Let’s explore the training set data.

Use the function plot_nhl_shots() for nice spatial plots of the data.

nhl_train <- analysis(nhl_val$splits[[1]])

set.seed(100)
nhl_train %>% 
  sample_n(200) %>%
  plot_nhl_shots(emphasis = position)

08:00

Prepare your data for modeling

  • The recipes package is an extensible framework for pipeable sequences of feature engineering steps that provide preprocessing tools to be applied to data.
  • Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.
  • The resulting processed output can be used as inputs for statistical or machine learning models.

A first recipe

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train)
  • The recipe() function assigns columns to roles of “outcome” or “predictor” using the formula

A first recipe

summary(nhl_rec)
#> # A tibble: 17 × 4
#>    variable          type    role      source  
#>    <chr>             <chr>   <chr>     <chr>   
#>  1 period            numeric predictor original
#>  2 period_type       nominal predictor original
#>  3 coord_x           numeric predictor original
#>  4 coord_y           numeric predictor original
#>  5 game_time         numeric predictor original
#>  6 strength          nominal predictor original
#>  7 player            nominal predictor original
#>  8 player_diff       numeric predictor original
#>  9 offense_team      nominal predictor original
#> 10 defense_team      nominal predictor original
#> 11 offense_goal_diff numeric predictor original
#> 12 game_type         nominal predictor original
#> 13 position          nominal predictor original
#> 14 dow               nominal predictor original
#> 15 month             nominal predictor original
#> 16 year              numeric predictor original
#> 17 on_goal           nominal outcome   original

Create indicator variables

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors())
  • For any factor or character predictors, make binary indicators.

  • There are many recipe steps that can convert categorical predictors to numeric columns.

Filter out constant columns

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())

In case there is a factor level that was never observed in the training data (resulting in a column of all 0s), we can delete any zero-variance predictors that have a single unique value.

Normalization

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors())
  • This centers and scales the numeric predictors.

  • The recipe will use the training set to estimate the means and standard deviations of the data.

  • All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).

Reduce correlation

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.9)

To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.

Other possible steps

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors())

PCA feature extraction…

Other possible steps

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  embed::step_umap(all_numeric_predictors(), outcome = on_goal)

A fancy machine learning supervised dimension reduction technique…

Other possible steps

nhl_rec <- 
  recipe(on_goal ~ ., data = nhl_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_ns(coord_y, coord_x, deg_free = 10)

Nonlinear transforms like natural splines, and so on!

Your turn

Create a recipe() for the on-goal data to :

  • create one-hot indicator variables
  • remove zero-variance variables
03:00

Minimal recipe

nhl_indicators <-
  recipe(on_goal ~ ., data = nhl_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

Using a workflow

set.seed(9)

nhl_glm_wflow <-
  workflow() %>%
  add_recipe(nhl_indicators) %>%
  add_model(logistic_reg())
 
ctrl <- control_resamples(save_pred = TRUE)
nhl_glm_res <-
  nhl_glm_wflow %>%
  fit_resamples(nhl_val, control = ctrl)

collect_metrics(nhl_glm_res)
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.555     1      NA Preprocessor1_Model1
#> 2 roc_auc  binary     0.558     1      NA Preprocessor1_Model1

Your turn

Use fit_resamples() to fit your workflow with a recipe.

Collect the predictions from the results.

05:00

Holdout predictions

# Since we used `save_pred = TRUE`
glm_val_pred <- collect_predictions(nhl_glm_res)
glm_val_pred %>% slice(1:7)
#> # A tibble: 7 × 7
#>   id         .pred_yes .pred_no  .row .pred_class on_goal .config             
#>   <chr>          <dbl>    <dbl> <int> <fct>       <fct>   <chr>               
#> 1 validation     0.198 8.02e- 1    10 no          no      Preprocessor1_Model1
#> 2 validation     0.264 7.36e- 1    17 no          yes     Preprocessor1_Model1
#> 3 validation     0.189 8.11e- 1    23 no          no      Preprocessor1_Model1
#> 4 validation     1.00  8.39e-11    40 yes         yes     Preprocessor1_Model1
#> 5 validation     0.322 6.78e- 1    41 no          yes     Preprocessor1_Model1
#> 6 validation     1.00  8.39e-11    46 yes         no      Preprocessor1_Model1
#> 7 validation     0.354 6.46e- 1    55 no          no      Preprocessor1_Model1

Two class data

Let’s say we can define one class as the “event”, like a shot being on goal.

  • The sensitivity is the true positive rate (accuracy on actual events).

  • The specificity is the true negative rate (accuracy on actual non-events, or 1 - false positive rate).

Two class data

These definitions assume that we know the threshold for converting “soft” probability predictions into “hard” class predictions.

Is a 50% threshold good?

What happens if we say that we need to be 80% sure to declare an event?

  • sensitivity ⬇️, specificity ⬆️

What happens for a 20% threshold?

  • sensitivity ⬆️, specificity ⬇️

ROC curves

To make an ROC (receiver operator characteristic) curve, we:

  • calculate the sensitivity and specificity for all possible thresholds

  • plot false positive rate (x-axis) versus true positive rate (y-axis)

We can use the area under the ROC curve as a classification metric:

  • ROC AUC = 1 💯
  • ROC AUC = 1/2 😢

ROC curves

# Assumes _first_ factor level is event; there are options to change that
roc_curve_points <- glm_val_pred %>% roc_curve(truth = on_goal, estimate = .pred_yes)
roc_curve_points %>% slice(1, 50, 100)
#> # A tibble: 3 × 3
#>   .threshold specificity sensitivity
#>        <dbl>       <dbl>       <dbl>
#> 1   -Inf          0            1    
#> 2      0.139      0.0303       0.977
#> 3      0.272      0.0642       0.955

glm_val_pred %>% roc_auc(truth = on_goal, estimate = .pred_yes)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.558

ROC curve plot

autoplot(roc_curve_points)

Your turn

Compute and plot an ROC curve for your current model.

05:00

Your turn

What data are being used for this ROC curve plot?

03:00

What do we do with the player data? 🏒

There are 597 unique player values in our training set. How can we include this information in our model?

We could:

  • make the full set of indicator variables 😳

  • use feature hashing to create a smaller set of indicator variables

  • use effect encoding to replace the player column with the estimated effect of that predictor

Let’s use an effect encoding.

What is an effect encoding?

We replace the qualitative’s predictor data with their effect on the outcome.

Data before:

before
#> # A tibble: 7 × 3
#>   on_goal player            .row
#>   <fct>   <fct>            <int>
#> 1 yes     brian_dumoulin       1
#> 2 yes     patric_hornqvist     2
#> 3 yes     nikita_nesterov      3
#> 4 yes     jack_eichel          4
#> 5 yes     justin_williams      5
#> 6 yes     seth_jones           6
#> 7 yes     kris_letang          7

Data after:

after
#> # A tibble: 7 × 3
#>   on_goal  player  .row
#>   <fct>     <dbl> <int>
#> 1 yes     -0.114      1
#> 2 yes      0.631      2
#> 3 yes      0.142      3
#> 4 yes      0.220      4
#> 5 yes      0.248      5
#> 6 yes      0.0733     6
#> 7 yes      0.0774     7

The player column is replaced with the log-odds of being on goal.

Per-player statistics

  • Good statistical methods for estimating these rates use partial pooling.

  • Pooling borrows strength across players and shrinks extreme values (e.g. zero or one) towards the mean for players with very few shots.

  • The embed package has recipe steps for effect encodings.

Partial pooling

Player effects

library(embed)

nhl_effect_rec <-
  recipe(on_goal ~ ., data = nhl_train) %>%
  step_lencode_mixed(player, outcome = vars(on_goal)) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

It is very important to appropriately validate the effect encoding step to make sure that we are not overfitting.

Recipes are estimated

Preprocessing steps in a recipe use the training set to compute quantities.

What kind of quantities are computed for preprocessing?

  • Levels of a factor
  • Whether a column has zero variance
  • Normalization
  • Feature extraction
  • Effect encodings

When a recipe is part of a workflow, this estimation occurs when fit() is called.

Effect encoding results

nhl_effect_wflow <-
  nhl_glm_wflow %>%
  update_recipe(nhl_effect_rec)

nhl_effect_res <-
  nhl_effect_wflow %>%
  fit_resamples(nhl_val, control = ctrl)

collect_metrics(nhl_effect_res)
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.540     1      NA Preprocessor1_Model1
#> 2 roc_auc  binary     0.551     1      NA Preprocessor1_Model1

Where is the shot coming from? 🏒🧐

Angle

nhl_angle_rec <-
  nhl_indicators %>%
  step_mutate(
    angle = abs(atan2(abs(coord_y), (89 - abs(coord_x))) * (180 / pi))
  )

Distance

nhl_distance_rec <-
  nhl_angle_rec %>%
  step_mutate(
    distance = sqrt((89 - abs(coord_x))^2 + abs(coord_y)^2),
    distance = log(distance)
  )

Behind goal line

nhl_behind_rec <-
  nhl_distance_rec %>%
  step_mutate(
    behind_goal_line = ifelse(abs(coord_x) >= 89, 1, 0)
  )

Fit different recipes

A workflow set can cross models and/or preprocessors and then resample them en masse.

set.seed(9)

nhl_glm_set_res <-
  workflow_set(
    list(`1_dummy` = nhl_indicators, `2_angle` = nhl_angle_rec, 
         `3_dist` = nhl_distance_rec, `4_bgl` = nhl_behind_rec),
    list(logistic = logistic_reg())
  ) %>%
  workflow_map(fn = "fit_resamples", resamples = nhl_val, verbose = TRUE, control = ctrl)

Your turn

Create a workflow set with 2 or 3 recipes.

(Consider using recipes we’ve already created.)

Use workflow_map() to resample the workflow set.

08:00

Compare recipes

library(forcats)
collect_metrics(nhl_glm_set_res) %>%
  filter(.metric == "roc_auc") %>%
  mutate(
    features = gsub("_logistic", "", wflow_id), 
    features = fct_reorder(features, mean)
  ) %>%
  ggplot(aes(x = mean, y = features)) +
  geom_point(size = 3) +
  labs(y = NULL, x = "ROC AUC (validation set)")

Compare recipes

Debugging a recipe

  • Typically, you will want to use a workflow to estimate and apply a recipe.
  • If you have an error and need to debug your recipe, the original recipe object (e.g. encoded_players) can be estimated manually with a function called prep(). It is analogous to fit().
  • Another function (bake()) is analogous to predict(), and gives you the processed data back.

More on recipes

  • Once fit() is called on a workflow, changing the model does not re-fit the recipe.
  • Some steps can be skipped when using predict().
  • The order of the steps matters.