Machine learning with tidymodels
Several years worth of pre-pandemic data were assembled to try to predict the daily number of people entering the Clark and Lake elevated (“L”) train station in Chicago.
More information:
Several Chapters in Feature Engineering and Selection.
The data are in modeldata
. See ?Chicago
.
Take a look at these data for a few minutes and see if you can find any interesting characteristics in the predictors or the outcome.
library(tidymodels)
library(rules)
data("Chicago")
dim(Chicago)
#> [1] 5698 50
stations
#> [1] "Austin" "Quincy_Wells" "Belmont" "Archer_35th"
#> [5] "Oak_Park" "Western" "Clark_Lake" "Clinton"
#> [9] "Merchandise_Mart" "Irving_Park" "Washington_Wells" "Harlem"
#> [13] "Monroe" "Polk" "Ashland" "Kedzie"
#> [17] "Addison" "Jefferson_Park" "Montrose" "California"
05:00
Let’s put the last two weeks of data into the test set. initial_time_split()
can be used for this purpose:
Our Chicago data is over time. Regular cross-validation, which uses random sampling, may not be the best idea.
We can emulate our training/test split by making similar resamples.
Fold 1: Take the first X years of data as the analysis set, the next 2 weeks as the assessment set.
Fold 2: Take the first X years + 2 weeks of data as the analysis set, the next 2 weeks as the assessment set.
and so on
Use the date
column to find the date data.
Our units will be weeks.
Every analysis set has 15 years of data
Every assessment set has 2 weeks of data
Increment by 2 weeks so that there are no overlapping assessment sets.
chi_rs
#> # Sliding period resampling
#> # A tibble: 16 × 2
#> splits id
#> <list> <chr>
#> 1 <split [5463/14]> Slice01
#> 2 <split [5467/14]> Slice02
#> 3 <split [5467/14]> Slice03
#> 4 <split [5467/14]> Slice04
#> 5 <split [5467/14]> Slice05
#> 6 <split [5467/14]> Slice06
#> 7 <split [5467/14]> Slice07
#> 8 <split [5467/14]> Slice08
#> 9 <split [5467/14]> Slice09
#> 10 <split [5467/14]> Slice10
#> 11 <split [5467/14]> Slice11
#> 12 <split [5467/14]> Slice12
#> 13 <split [5467/14]> Slice13
#> 14 <split [5467/14]> Slice14
#> 15 <split [5467/14]> Slice15
#> 16 <split [5467/11]> Slice16
We will fit 16 models on 16 slightly different analysis sets.
Each will produce a separate performance metrics.
We will average the 16 metrics to get the resampling estimate of that statistic.
Based on the formula, the function assigns columns to roles of “outcome” or “predictor”
summary(chi_rec)
#> # A tibble: 50 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 Austin <chr [2]> predictor original
#> 2 Quincy_Wells <chr [2]> predictor original
#> 3 Belmont <chr [2]> predictor original
#> 4 Archer_35th <chr [2]> predictor original
#> 5 Oak_Park <chr [2]> predictor original
#> 6 Western <chr [2]> predictor original
#> 7 Clark_Lake <chr [2]> predictor original
#> 8 Clinton <chr [2]> predictor original
#> 9 Merchandise_Mart <chr [2]> predictor original
#> 10 Irving_Park <chr [2]> predictor original
#> # ℹ 40 more rows
This creates three new columns in the data based on the date. Note that the day-of-the-week column is a factor.
Add indicators for major holidays. Specific holidays, especially those non-USA, can also be generated.
At this point, we don’t need date
anymore. Instead of deleting it (there is a step for that) we will change its role to be an identification variable.
date
is still in the data set but tidymodels knows not to treat it as an analysis column.
update_role_requirements()
is needed to make sure that this column is required when making new data points.
The station columns have a very high degree of correlation.
We might want to decorrelated them with principle component analysis to help the model fits go more easily.
The vector stations
contains all station names and can be used to identify all the relevant columns.
We’ll tune the number of PCA components for (default) values of one to four.
Let’s try three models. The first one requires the rules
package (loaded earlier).
cb_spec <- cubist_rules(committees = 25, neighbors = tune())
mars_spec <- mars(prod_degree = tune()) %>% set_mode("regression")
lm_spec <- linear_reg()
chi_set <-
workflow_set(
list(pca = chi_pca_rec, basic = chi_rec),
list(cubist = cb_spec, mars = mars_spec, lm = lm_spec)
) %>%
# Evaluate models using mean absolute errors
option_add(metrics = metric_set(mae))
rank_results(chi_res)
#> # A tibble: 31 × 9
#> wflow_id .config .metric mean std_err n preprocessor model rank
#> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
#> 1 pca_cubist Preprocessor1_Model1 mae 0.798 0.104 16 recipe cubis… 1
#> 2 pca_cubist Preprocessor3_Model3 mae 0.978 0.110 16 recipe cubis… 2
#> 3 pca_cubist Preprocessor4_Model2 mae 0.983 0.122 16 recipe cubis… 3
#> 4 pca_cubist Preprocessor4_Model1 mae 0.991 0.127 16 recipe cubis… 4
#> 5 pca_cubist Preprocessor3_Model2 mae 0.991 0.113 16 recipe cubis… 5
#> 6 pca_cubist Preprocessor2_Model2 mae 1.02 0.118 16 recipe cubis… 6
#> 7 pca_cubist Preprocessor1_Model3 mae 1.05 0.134 16 recipe cubis… 7
#> 8 basic_cubist Preprocessor1_Model8 mae 1.07 0.115 16 recipe cubis… 8
#> 9 basic_cubist Preprocessor1_Model7 mae 1.07 0.112 16 recipe cubis… 9
#> 10 basic_cubist Preprocessor1_Model6 mae 1.07 0.114 16 recipe cubis… 10
#> # ℹ 21 more rows
We can also pull out the specific tuning results and look at them:
final_fit
? Model stacks generate predictions that are informed by several models.
final_fit
? final_fit
? final_fit
? final_fit
? final_fit
? Collect all of the resampling results for all model configurations.
Which configurations should be retained? Uses a penalized linear model:
set.seed(122)
chi_stack_res <- blend_predictions(chi_stack)
chi_stack_res
#> # A tibble: 5 × 3
#> member type weight
#> <chr> <chr> <dbl>
#> 1 pca_cubist_1_1 cubist_rules 0.343
#> 2 pca_cubist_3_2 cubist_rules 0.236
#> 3 basic_cubist_1_4 cubist_rules 0.189
#> 4 pca_lm_4_1 linear_reg 0.163
#> 5 pca_cubist_3_3 cubist_rules 0.109
The overall results of the penalized model:
For each model we retain in the stack, we need their model fit on the entire training set.
We can pull out the results and the workflow to fit the single best cubist model.
We don’t have last_fit()
for stacks (yet) so we manually make predictions.
Single best versus the stack: