Extras - Effect Encodings

Advanced tidymodels

Previously - Setup

library(tidymodels)
library(textrecipes)
library(bonsai)

# Max's usual settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates |> 
  sample_n(5000) |> 
  arrange(arrival_date) |> 
  select(-arrival_date) |> 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <-
  initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)

What do we do with the agent and company data?

There are 98 unique agent values and 100 companies in our training set. How can we include this information in our model?

We could:

make the full set of indicator variables 😳
lump agents and companies that rarely occur into an “other” group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the agent and company columns with the estimated effect of that predictor

Per-agent statistics

What is an effect encoding?

We replace the qualitative’s predictor data with their effect on the outcome.

Data before:

before
#> # A tibble: 7 × 3
#>   avg_price_per_room agent            .row
#>                <dbl> <fct>           <int>
#> 1               52.7 cynthia_worsley     1
#> 2               51.8 carlos_bryant       2
#> 3               53.8 lance_hitchcock     3
#> 4               51.8 lance_hitchcock     4
#> 5               46.8 cynthia_worsley     5
#> 6               54.7 charles_najera      6
#> 7               46.8 cynthia_worsley     7

Data after:

after
#> # A tibble: 7 × 3
#>   avg_price_per_room agent  .row
#>                <dbl> <dbl> <int>
#> 1               52.7  88.5     1
#> 2               51.8  89.5     2
#> 3               53.8  79.8     3
#> 4               51.8  79.8     4
#> 5               46.8  88.5     5
#> 6               54.7 109.      6
#> 7               46.8  88.5     7

The agent column is replaced with an estimate of the ADR.

Per-agent statistics again

Good statistical methods for estimating these means use partial pooling.
Pooling borrows strength across agents and shrinks extreme values towards the mean for agents with very few transations
The embed package has recipe steps for effect encodings.

Partial pooling

Agent effects

library(embed)

hotel_effect_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) |> 
  step_YeoJohnson(lead_time) |>
  step_lencode_mixed(agent, company, outcome = vars(avg_price_per_room)) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

It is very important to appropriately validate the effect encoding step to make sure that we are not overfitting.

Effect encoding results

hotel_effect_wflow <-
  workflow() |>
  add_model(linear_reg()) |> 
  update_recipe(hotel_effect_rec)

reg_metrics <- metric_set(mae, rsq)

hotel_effect_res <-
  hotel_effect_wflow |>
  fit_resamples(hotel_rs, metrics = reg_metrics)

collect_metrics(hotel_effect_res)
#> # A tibble: 2 × 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 mae     standard   17.8      10 0.189   Preprocessor1_Model1
#> 2 rsq     standard    0.870    10 0.00357 Preprocessor1_Model1

Slightly worse but it can handle new agents (if they occur).