Advanced tidymodels
There are 98 unique agent values and 100 companies in our training set. How can we include this information in our model?
We could:
make the full set of indicator variables 😳
lump agents and companies that rarely occur into an “other” group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the agent
and company
columns with the estimated effect of that predictor
We replace the qualitative’s predictor data with their effect on the outcome.
Data before:
The agent
column is replaced with an estimate of the ADR.
Good statistical methods for estimating these means use partial pooling.
Pooling borrows strength across agents and shrinks extreme values towards the mean for agents with very few transations
The embed package has recipe steps for effect encodings.
It is very important to appropriately validate the effect encoding step to make sure that we are not overfitting.
hotel_effect_wflow <-
workflow() %>%
add_model(linear_reg()) %>%
update_recipe(hotel_effect_rec)
reg_metrics <- metric_set(mae, rsq)
hotel_effect_res <-
hotel_effect_wflow %>%
fit_resamples(hotel_rs, metrics = reg_metrics)
collect_metrics(hotel_effect_res)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 mae standard 17.8 10 0.236 Preprocessor1_Model1
#> 2 rsq standard 0.867 10 0.00377 Preprocessor1_Model1
Slightly worse but it can handle new agents (if they occur).