Machine learning with tidymodels

Several years worth of pre-pandemic data were assembled to try to predict the daily number of people entering the Clark and Lake elevated (“L”) train station in Chicago.

More information:

Several Chapters in

*Feature Engineering and Selection*.- Start with Section 4.1
- See Section 1.3

- the 14-day lagged ridership at this and other stations (units: thousands of rides/day)
- weather data
- home/away game schedules for Chicago teams
- the date

The data are in `modeldata`

. See `?Chicago`

.

Take a look at these data for a few minutes and see if you can find any interesting characteristics in the predictors or the outcome.

```
library(tidymodels)
library(rules)
data("Chicago")
dim(Chicago)
#> [1] 5698 50
stations
#> [1] "Austin" "Quincy_Wells" "Belmont" "Archer_35th"
#> [5] "Oak_Park" "Western" "Clark_Lake" "Clinton"
#> [9] "Merchandise_Mart" "Irving_Park" "Washington_Wells" "Harlem"
#> [13] "Monroe" "Polk" "Ashland" "Kedzie"
#> [17] "Addison" "Jefferson_Park" "Montrose" "California"
```

`05:00`

Let’s put the last two weeks of data into the test set. `initial_time_split()`

can be used for this purpose:

Our Chicago data is over time. Regular cross-validation, which uses random sampling, may not be the best idea.

We can emulate our training/test split by making similar resamples.

Fold 1: Take the first X years of data as the analysis set, the next 2 weeks as the assessment set.

Fold 2: Take the first X years + 2 weeks of data as the analysis set, the next 2 weeks as the assessment set.

and so on

Use the `date`

column to find the date data.

Our units will be weeks.

Every analysis set has 15 years of data

Every assessment set has 2 weeks of data

Increment by 2 weeks so that there are no overlapping assessment sets.

```
chi_rs
#> # Sliding period resampling
#> # A tibble: 16 × 2
#> splits id
#> <list> <chr>
#> 1 <split [5463/14]> Slice01
#> 2 <split [5467/14]> Slice02
#> 3 <split [5467/14]> Slice03
#> 4 <split [5467/14]> Slice04
#> 5 <split [5467/14]> Slice05
#> 6 <split [5467/14]> Slice06
#> 7 <split [5467/14]> Slice07
#> 8 <split [5467/14]> Slice08
#> 9 <split [5467/14]> Slice09
#> 10 <split [5467/14]> Slice10
#> 11 <split [5467/14]> Slice11
#> 12 <split [5467/14]> Slice12
#> 13 <split [5467/14]> Slice13
#> 14 <split [5467/14]> Slice14
#> 15 <split [5467/14]> Slice15
#> 16 <split [5467/11]> Slice16
```

We will fit 16 models on 16 slightly different analysis sets.

Each will produce a separate performance metrics.

We will average the 16 metrics to get the resampling estimate of that statistic.

Based on the formula, the function assigns columns to roles of “outcome” or “predictor”

```
summary(chi_rec)
#> # A tibble: 50 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 Austin numeric predictor original
#> 2 Quincy_Wells numeric predictor original
#> 3 Belmont numeric predictor original
#> 4 Archer_35th numeric predictor original
#> 5 Oak_Park numeric predictor original
#> 6 Western numeric predictor original
#> 7 Clark_Lake numeric predictor original
#> 8 Clinton numeric predictor original
#> 9 Merchandise_Mart numeric predictor original
#> 10 Irving_Park numeric predictor original
#> # … with 40 more rows
```

This creates three new columns in the data based on the date. Note that the day-of-the-week column is a factor.

Add indicators for major holidays. Specific holidays, especially those non-USA, can also be generated.

At this point, we don’t need `date`

anymore. Instead of deleting it (there is a step for that) we will change its *role* to be an identification variable.

`date`

is still in the data set but tidymodels knows not to treat it as an analysis column.

`update_role_requirements()`

is needed to make sure that this column is required when making new data points.

The station columns have a very high degree of correlation.

We might want to decorrelated them with principle component analysis to help the model fits go more easily.

The vector `stations`

contains all station names and can be used to identify all the relevant columns.

We’ll tune the number of PCA components for (default) values of one to four.

Let’s try three models. The first one requires the `rules`

package (loaded earlier).

```
cb_spec <- cubist_rules(committees = 25, neighbors = tune())
mars_spec <- mars(prod_degree = tune()) %>% set_mode("regression")
lm_spec <- linear_reg()
chi_set <-
workflow_set(
list(pca = chi_pca_rec, basic = chi_rec),
list(cubist = cb_spec, mars = mars_spec, lm = lm_spec)
) %>%
# Evaluate models using mean absolute errors
option_add(metrics = metric_set(mae))
```

```
rank_results(chi_res)
#> # A tibble: 31 × 9
#> wflow_id .config .metric mean std_err n preprocessor model rank
#> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
#> 1 pca_cubist Preprocessor1_Model1 mae 0.798 0.104 16 recipe cubis… 1
#> 2 pca_cubist Preprocessor3_Model3 mae 0.978 0.110 16 recipe cubis… 2
#> 3 pca_cubist Preprocessor4_Model2 mae 0.983 0.122 16 recipe cubis… 3
#> 4 pca_cubist Preprocessor4_Model1 mae 0.991 0.127 16 recipe cubis… 4
#> 5 pca_cubist Preprocessor3_Model2 mae 0.991 0.113 16 recipe cubis… 5
#> 6 pca_cubist Preprocessor2_Model2 mae 1.02 0.118 16 recipe cubis… 6
#> 7 pca_cubist Preprocessor1_Model3 mae 1.05 0.134 16 recipe cubis… 7
#> 8 basic_cubist Preprocessor1_Model8 mae 1.07 0.115 16 recipe cubis… 8
#> 9 basic_cubist Preprocessor1_Model7 mae 1.07 0.112 16 recipe cubis… 9
#> 10 basic_cubist Preprocessor1_Model6 mae 1.07 0.114 16 recipe cubis… 10
#> # … with 21 more rows
```

We can also pull out the specific tuning results and look at them:

`final_fit`

? *Model stacks* generate predictions that are informed by several models.

`final_fit`

? `final_fit`

? `final_fit`

? `final_fit`

? `final_fit`

? - Define candidate members
- Initialize a data stack object
- Add candidate ensemble members to the data stack
- Evaluate how to combine their predictions
- Fit candidate ensemble members with non-zero stacking coefficients
- Predict on new data!

Collect all of the resampling results for all model configurations.

Which configurations should be retained? Uses a penalized linear model:

```
set.seed(122)
chi_stack_res <- blend_predictions(chi_stack)
chi_stack_res
#> # A tibble: 5 × 3
#> member type weight
#> <chr> <chr> <dbl>
#> 1 pca_cubist_1_1 cubist_rules 0.343
#> 2 pca_cubist_3_2 cubist_rules 0.236
#> 3 basic_cubist_1_4 cubist_rules 0.189
#> 4 pca_lm_4_1 linear_reg 0.163
#> 5 pca_cubist_3_3 cubist_rules 0.109
```

The overall results of the penalized model:

For each model we retain in the stack, we need their model fit on the entire training set.

We can pull out the results and the workflow to fit the single best cubist model.

We don’t have `last_fit()`

for stacks (yet) so we manually make predictions.

Single best versus the stack: