Advanced tidymodels
We might want to modify our predictors columns for a few reasons:
The first two reasons are fairly predictable (next page).
The last one depends on your modeling problem.
Think of a feature as some representation of a predictor that will be used in a model.
Example representations:
There are a lot of examples in Feature Engineering and Selection (FES).
How can we represent date columns for our model?
When we use a date column in its native format, most models in R convert it to an integer.
We can re-engineer it as:
Data preprocessing steps allow your model to fit.
Feature engineering steps help the model do the least work to predict the outcome as well as possible.
The recipes package can handle both!
We’ll use data on hotels to predict the cost of a room.
The data are in the modeldatatoo package. We’ll sample down the data and refactor some columns:
Let’s split the data into a training set (75%) and testing set (25%):
Let’s take some time and investigate the training data. The outcome is avg_price_per_room
.
Are there any interesting characteristics of the data?
10:00
We’ll use simple 10-fold cross-validation (stratified sampling):
set.seed(472)
hotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)
hotel_rs
#> # 10-fold cross-validation using stratification
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [3372/377]> Fold01
#> 2 <split [3373/376]> Fold02
#> 3 <split [3373/376]> Fold03
#> 4 <split [3373/376]> Fold04
#> 5 <split [3373/376]> Fold05
#> 6 <split [3374/375]> Fold06
#> 7 <split [3375/374]> Fold07
#> 8 <split [3376/373]> Fold08
#> 9 <split [3376/373]> Fold09
#> 10 <split [3376/373]> Fold10
recipe()
function assigns columns to roles of “outcome” or “predictor” using the formulasummary(hotel_rec)
#> # A tibble: 28 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 lead_time <chr [2]> predictor original
#> 2 arrival_date_day_of_month <chr [2]> predictor original
#> 3 stays_in_weekend_nights <chr [2]> predictor original
#> 4 stays_in_week_nights <chr [2]> predictor original
#> 5 adults <chr [2]> predictor original
#> 6 children <chr [2]> predictor original
#> 7 babies <chr [2]> predictor original
#> 8 meal <chr [3]> predictor original
#> 9 country <chr [3]> predictor original
#> 10 market_segment <chr [3]> predictor original
#> # ℹ 18 more rows
The type
column contains information on the variables
What do you think are in the type
vectors for the lead_time
and country
columns?
02:00
For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.
step_dummy()
records the levels of the categorical predictors in the training set.
In case there is a factor level that was never observed in the training data (resulting in a column of all 0
s), we can delete any zero-variance predictors that have a single unique value.
This centers and scales the numeric predictors.
The recipe will use the training set to estimate the means and standard deviations of the data.
To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.
PCA feature extraction…
A fancy machine learning supervised dimension reduction technique…
Nonlinear transforms like natural splines, and so on!
Create a recipe()
for the hotel data to:
lead_time
03:00
We’ll compute two measures: mean absolute error and the coefficient of determination (a.k.a \(R^2\)).
\[\begin{align} MAE &= \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \notag \\ R^2 &= cor(y_i, \hat{y}_i)^2 \end{align}\]
The focus will be on MAE for parameter optimization. We’ll use a metric set to compute these:
set.seed(9)
hotel_lm_wflow <-
workflow() %>%
add_recipe(hotel_indicators) %>%
add_model(linear_reg())
ctrl <- control_resamples(save_pred = TRUE)
hotel_lm_res <-
hotel_lm_wflow %>%
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_lm_res)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 mae standard 17.3 10 0.199 Preprocessor1_Model1
#> 2 rsq standard 0.874 10 0.00400 Preprocessor1_Model1
Use fit_resamples()
to fit your workflow with a recipe.
Collect the predictions from the results.
05:00
# Since we used `save_pred = TRUE`
lm_val_pred <- collect_predictions(hotel_lm_res)
lm_val_pred %>% slice(1:7)
#> # A tibble: 7 × 5
#> id .pred .row avg_price_per_room .config
#> <chr> <dbl> <int> <dbl> <chr>
#> 1 Fold01 62.1 20 40 Preprocessor1_Model1
#> 2 Fold01 48.0 28 54 Preprocessor1_Model1
#> 3 Fold01 64.6 45 50 Preprocessor1_Model1
#> 4 Fold01 45.8 49 42 Preprocessor1_Model1
#> 5 Fold01 45.8 61 49 Preprocessor1_Model1
#> 6 Fold01 30.0 66 40 Preprocessor1_Model1
#> 7 Fold01 38.8 88 49 Preprocessor1_Model1
There are 98 unique agent values and 100 unique companies in our training set. How can we include this information in our model?
We could:
make the full set of indicator variables 😳
lump agents and companies that rarely occur into an “other” group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the agent
and company
columns with the estimated effect of that predictor (in the extra materials)
There is a recipe step that will redefine factor levels based on their frequency in the training set:
Using this code, 34 agents (out of 98) were collapsed into “other” based on the training set.
We could try to optimize the threshold for collapsing (see the next set of slides on model tuning).
hotel_other_wflow <-
hotel_lm_wflow %>%
update_recipe(hotel_other_rec)
hotel_other_res <-
hotel_other_wflow %>%
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_other_res)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 mae standard 17.4 10 0.205 Preprocessor1_Model1
#> 2 rsq standard 0.874 10 0.00417 Preprocessor1_Model1
Aabout the same MAE and much faster to complete.
Now let’s look at a more sophisticated tool called effect feature hashing.
Between agent
and company
, simple dummy variables would create 198 new columns (that are mostly zeros).
Another option is to have a binary indicator that combines some levels of these variables.
Feature hashing (for more see FES, SMLTAR, and TMwR):
Suppose we want to use 32 indicator variables for agent
.
For a agent with value “Max_Kuhn
”, a hashing function converts it to an integer (say 210397726).
To assign it to one of the 32 columns, we would use modular arithmetic to assign it to a column:
Hash functions are meant to emulate randomness.
The textrecipes package has a step that can be added to the recipe:
library(textrecipes)
hash_rec <-
recipe(avg_price_per_room ~ ., data = hotel_tr) %>%
step_YeoJohnson(lead_time) %>%
# Defaults to 32 signed indicator columns
step_dummy_hash(agent) %>%
step_dummy_hash(company) %>%
# Regular indicators for the others
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
hotel_hash_wflow <-
hotel_lm_wflow %>%
update_recipe(hash_rec)
hotel_hash_res <-
hotel_hash_wflow %>%
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_hash_res)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 mae standard 17.5 10 0.256 Preprocessor1_Model1
#> 2 rsq standard 0.872 10 0.00395 Preprocessor1_Model1
About the same performance but now we can handle new values.
hash_rec
) can be estimated manually with a function called prep()
. It is analogous to fit()
. See TMwR section 16.4bake()
) is analogous to predict()
, and gives you the processed data back.tidy()
function can be used to get specific results from the recipe.fit()
is called on a workflow, changing the model does not re-fit the recipe.predict()
.