Getting More Out of Feature Engineering and Tuning for Machine Learning
# Load our example data for this section
"https://raw.githubusercontent.com/tidymodels/" |>
paste0("workshops/main/slides/class_data.RData") |>
url() |>
load()
set.seed(429)
sim_split <- initial_split(class_data, prop = 0.75, strata = class)
sim_train <- training(sim_split)
sim_test <- testing(sim_split)
set.seed(523)
sim_rs <- vfold_cv(sim_train, v = 10, strata = class)
Some models automatically remove predictors by never using them in the model:
glmnet
)Sometimes using irrelevant predictors hurts model performance.
wrappers: a sequential algorithm proposes feature subsets, fits the model with these subsets, and then determines a better subset from the results.
filters: screen predictors before adding them to the model.
tidymodels doesn’t have any wrappers (but see the caret documentation for them)
The new important package does have filters via recipes.
tidymodels has always contained some “hidden guardrails” that should prevent practitioners from making subtle (but consequential) methodological mistakes.
Feature selection is a good example. Based on the literature, it is easily done wrong.
The selection process should take place inside a resampling loop so that the workflow does not overfit the predictors.
We released two packages this year that enable supervised feature selection:
Let’s look at the help page for important::step_predictor_best()
.
rec <-
recipe(class ~ ., data = sim_train) |>
step_predictor_best(
all_predictors(),
score = "imp_rf",
prop_terms = tune(),
id = "filter"
) |>
step_normalize(all_numeric_predictors())
knn_spec <-
nearest_neighbor(neighbors = tune(), weight_func = tune()) |>
set_mode("classification")
thrsh_tlr <-
tailor() |>
adjust_probability_threshold(threshold = tune())
knn_fit <- fit_best(knn_res, metric = "brier_class")
filter_info <-
knn_fit |>
extract_recipe() |>
tidy(id = "filter")
filter_info
#> # A tibble: 30 × 4
#> terms removed score id
#> <chr> <lgl> <dbl> <chr>
#> 1 predictor_01 TRUE 0.00152 filter
#> 2 predictor_02 TRUE 0.00170 filter
#> 3 predictor_03 TRUE -0.0000460 filter
#> 4 predictor_04 TRUE 0.000968 filter
#> 5 predictor_05 TRUE -0.0000626 filter
#> 6 predictor_06 TRUE 0.000126 filter
#> 7 predictor_07 TRUE 0.000779 filter
#> 8 predictor_08 TRUE 0.00160 filter
#> 9 predictor_09 TRUE 0.00218 filter
#> 10 predictor_10 TRUE -0.000302 filter
#> # ℹ 20 more rows
The data were simulated and 15 out of 30 predictors were uninformative (and highly correlated). How did we do?
noise | real | |
---|---|---|
kept | 0 | 2 |
removed | 15 | 13 |
It was good at removing noise but not keeping the real predictors.
The simulation system is documented here with method = "caret"
. The two most important predictors being retained correspond to:
Most of the others are small linear effects and tree-based models are not great at modeling those.
Also, the noise predictors were simulated to have fairly high correlations with one another. That can often compromise random forest importance scores.
The important package has two other feature selection steps that can be used with multiple scores:
step_predictor_retain()
: choose predictors based on a logical statement. Example:step_predictor_desirability()
: choose multiple scores to compute then use desirability functions to rank them:We already have our fitted model and, if we are happy with it:
Resampling estimates:
#> # A tibble: 4 × 4
#> .metric mean n std_err
#> <chr> <dbl> <int> <dbl>
#> 1 sensitivity 0.958 10 0.0154
#> 2 specificity 0.867 10 0.00948
#> 3 brier_class 0.0349 10 0.00201
#> 4 roc_auc 0.968 10 0.00714
Similar to fit_best()
, there is a convenience function that can be used to get the final model and the test set results.
We have to start with a finalized workflow (i.e., no tune()
values):
last_fit()
uses the original split object to fit, predict, and measure the model using the test set:
knn_test_res <-
knn_last_wflow |>
last_fit(sim_split, metrics = cls_mtr)
knn_test_res
#> # Resampling results
#> # Manual resampling
#> # A tibble: 1 × 6
#> splits id .metrics .notes .predictions .workflow
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [1499/501]> train/test split <tibble> <tibble> <tibble> <workflow>
We can pick out the parts that we want:
knn_final_fit <- knn_test_res |> extract_workflow()
knn_test_pred <- knn_test_res |> collect_predictions()
knn_test_mtr <- knn_test_res |> collect_metrics()
knn_test_mtr
#> # A tibble: 4 × 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 sensitivity binary 0.982 pre0_mod0_post0
#> 2 specificity binary 0.874 pre0_mod0_post0
#> 3 brier_class binary 0.0246 pre0_mod0_post0
#> 4 roc_auc binary 0.981 pre0_mod0_post0
Easy peasy!