Extras - Time-based data splitting

Introduction to tidymodels

The raw taxi data set

We prepared the data set specifically for this introductory workshop.

It looked similar to this:

glimpse(taxi_raw)
#> Rows: 10,000
#> Columns: 24
#> $ trip_id                    <chr> "3ac8d4412642a35e9b9a493285814d7983d5a159",…
#> $ taxi_id                    <chr> "391317d70c5d06deec744062c4595dc1958b200fda…
#> $ trip_start_timestamp       <dttm> 2023-06-10 18:45:00, 2023-05-21 21:30:00, …
#> $ trip_end_timestamp         <dttm> 2023-06-10 19:30:00, 2023-05-21 21:45:00, …
#> $ trip_seconds               <dbl> 3258, 839, 476, 2220, 1588, 2270, 1575, 267…
#> $ trip_miles                 <dbl> 17.02, 2.16, 1.05, 17.40, 17.62, 16.36, 18.…
#> $ pickup_census_tract        <dbl> 17031980000, 17031839100, 17031320100, 1703…
#> $ dropoff_census_tract       <dbl> 17031081403, 17031081300, 17031081403, 1703…
#> $ pickup_community_area      <dbl> 76, 32, 32, 76, 32, 76, 76, 32, 32, 33, 8, …
#> $ dropoff_community_area     <dbl> 8, 8, 8, 32, 76, 8, 32, 76, 28, 32, 7, 33, …
#> $ fare                       <dbl> 44.50, 10.00, 6.75, 45.00, 44.25, 41.25, 44…
#> $ tips                       <dbl> 12.25, 4.00, 2.00, 9.90, 8.00, 9.15, 12.31,…
#> $ tolls                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ extras                     <dbl> 4.0, 1.0, 0.0, 4.0, 2.0, 4.0, 4.0, 0.0, 0.0…
#> $ trip_total                 <dbl> 61.25, 15.50, 9.25, 58.90, 54.75, 54.90, 61…
#> $ payment_type               <chr> "Credit Card", "Credit Card", "Credit Card"…
#> $ company                    <chr> "Taxicab Insurance Agency Llc", "Chicago In…
#> $ pickup_centroid_latitude   <dbl> 41.97907, 41.88099, 41.88499, 41.97907, 41.…
#> $ pickup_centroid_longitude  <dbl> -87.90304, -87.63275, -87.62099, -87.90304,…
#> $ pickup_centroid_location   <chr> "POINT (-87.9030396611 41.9790708201)", "PO…
#> $ dropoff_centroid_latitude  <dbl> 41.89092, 41.89833, 41.89092, 41.87102, 41.…
#> $ dropoff_centroid_longitude <dbl> -87.61887, -87.62076, -87.61887, -87.63141,…
#> $ dropoff_centroid_location  <chr> "POINT (-87.6188683546 41.8909220259)", "PO…
#> $ tip                        <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes…

Time nature of the data

We assumed only the month, day of the week, and hour mattered and treated each observation as independent.

If the data have a strong time component, all your data splitting strategies should support the model in estimating temporal trends.

Thus, don’t sample randomly because this breaks up the time component!

Splitting with time component

The more recent observations are assumed to be more similar to new data, so initial_time_split() puts them into the test set.

The function assumes that the data are already ordered.

taxi_raw <- taxi_raw %>%
  arrange(trip_start_timestamp)

taxi_split <- initial_time_split(taxi_raw, prop = 3 / 4)
taxi_split
#> <Training/Testing/Total>
#> <7500/2500/10000>

taxi_train <- training(taxi_split)
taxi_test  <- testing(taxi_split)

nrow(taxi_train)
#> [1] 7500
 
nrow(taxi_test)
#> [1] 2500

Time series resampling

The same idea also applies to resampling: the newer observations go into the assessment set.

For example:

  • Fold 1: Take the first X weeks of data as the analysis set, and the next 3 weeks as the assessment set.

  • Fold 2: Take weeks 2 to X + 1 as the analysis set, and the next 3 weeks as the assessment set.

  • and so on

Rolling origin forecast resampling

Times series resampling

taxi_rs <-
  taxi_train %>%
  sliding_period(
    index = "trip_start_timestamp",  




  )

Use the trip_start_timestamp column to find the date data.

Times series resampling

taxi_rs <-
  taxi_train %>%
  sliding_period(
    index = "trip_start_timestamp",  
    period = "week",



  )

Our units will be in weeks.

Times series resampling

taxi_rs <-
  taxi_train %>%
  sliding_period(
    index = "trip_start_timestamp",  
    period = "week",
    lookback = 8
    
    
  )

Every analysis set has 8 weeks of data.

Times series resampling

taxi_rs <-
  taxi_train %>%
  sliding_period(
    index = "trip_start_timestamp",  
    period = "week",
    lookback = 8,
    assess_stop = 3,

  )

Every assessment set has 3 weeks of data.

Times series resampling

taxi_rs <-
  taxi_train %>%
  sliding_period(
    index = "trip_start_timestamp",  
    period = "week",
    lookback = 8,
    assess_stop = 3,
    step = 1
  )

Increment by 1 week

taxi_rs$splits[[1]] %>% assessment() %>% pluck("trip_start_timestamp") %>% range()
#> [1] "2023-03-02 05:15:00 UTC" "2023-03-22 22:00:00 UTC"

taxi_rs$splits[[2]] %>% assessment() %>% pluck("trip_start_timestamp") %>% range()
#> [1] "2023-03-09 07:00:00 UTC" "2023-03-29 21:15:00 UTC"

taxi_rs$splits[[3]] %>% assessment() %>% pluck("trip_start_timestamp") %>% range()
#> [1] "2023-03-16 06:30:00 UTC" "2023-04-05 23:45:00 UTC"