2 - Your data budget

Introduction to tidymodels

Data on Chicago taxi trips

The city of Chicago releases anonymized trip-level data on taxi trips in the city.
We pulled a sample of 10,000 rides occurring in early 2022.
Type ?taxi to learn more about this dataset, including references.

Data on Chicago taxi trips

N = 10,000
A nominal outcome, tip, with levels "yes" and "no"
Several nominal variables like pickup & dropoff location, taxi ID, and payment type.
Several numeric variables like trip length and fare subtotals.

Checklist for predictors

Is it ethical to use this variable? (Or even legal?)
Will this variable be available at prediction time?
Does this variable contribute to explainability?

Data on Chicago taxi trips

library(tidymodels)

taxi
#> # A tibble: 10,000 × 7
#>    tip   distance company                      local dow   month  hour
#>    <fct>    <dbl> <fct>                        <fct> <fct> <fct> <int>
#>  1 yes      17.2  Chicago Independents         no    Thu   Feb      16
#>  2 yes       0.88 City Service                 yes   Thu   Mar       8
#>  3 yes      18.1  other                        no    Mon   Feb      18
#>  4 yes      20.7  Chicago Independents         no    Mon   Apr       8
#>  5 yes      12.2  Chicago Independents         no    Sun   Mar      21
#>  6 yes       0.94 Sun Taxi                     yes   Sat   Apr      23
#>  7 yes      17.5  Flash Cab                    no    Fri   Mar      12
#>  8 yes      17.7  other                        no    Sun   Jan       6
#>  9 yes       1.85 Taxicab Insurance Agency Llc no    Fri   Apr      12
#> 10 yes       1.47 City Service                 no    Tue   Mar      14
#> # ℹ 9,990 more rows

Data splitting and spending

For machine learning, we typically split data into training and test sets:

The training set is used to estimate model parameters.
The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data
we spend 🤑

the better estimates
we’ll get.

Data splitting and spending

Spending too much data in training prevents us from computing a good assessment of predictive performance.

Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

When is a good time to split your data?

03:00

The testing data is precious 💎

The initial split

set.seed(123)
taxi_split <- initial_split(taxi)
taxi_split
#> <Training/Testing/Total>
#> <7500/2500/10000>

What is `set.seed()`?

To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic give a “seed”.

This allows us to reproduce results by setting that seed.

Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.

Accessing the data

taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)

The training set

taxi_train
#> # A tibble: 7,500 × 7
#>    tip   distance company                   local dow   month  hour
#>    <fct>    <dbl> <fct>                     <fct> <fct> <fct> <int>
#>  1 yes       0.7  Taxi Affiliation Services yes   Tue   Mar      18
#>  2 yes       0.99 Sun Taxi                  yes   Tue   Jan       8
#>  3 yes       1.78 other                     no    Sat   Mar      22
#>  4 yes       0    Taxi Affiliation Services yes   Wed   Apr      15
#>  5 yes       0    Taxi Affiliation Services no    Sun   Jan      21
#>  6 yes       2.3  other                     no    Sat   Apr      21
#>  7 yes       6.35 Sun Taxi                  no    Wed   Mar      16
#>  8 yes       2.79 other                     no    Sun   Feb      14
#>  9 yes      16.6  other                     no    Sun   Apr      18
#> 10 yes       0.02 Chicago Independents      yes   Sun   Apr      15
#> # ℹ 7,490 more rows

The test set

🙈

There are 2500 rows and 7 columns in the test set.

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8)
taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)

nrow(taxi_train)
#> [1] 8000
nrow(taxi_test)
#> [1] 2000

What about a validation set?

Validation set

set.seed(123)
initial_validation_split(taxi, prop = c(0.6, 0.2))
#> <Training/Validation/Testing/Total>
#> <6000/2000/2000/10000>

Exploratory data analysis for ML 🧐

Your turn

Explore the taxi_train data on your own!

What’s the distribution of the outcome, tip?
What’s the distribution of numeric variables like distance?
How does tip differ across the categorical variables?

08:00

taxi_train %>% 
  ggplot(aes(x = tip)) +
  geom_bar()

taxi_train %>% 
  ggplot(aes(x = tip, fill = local)) +
  geom_bar() +
  scale_fill_viridis_d(end = .5)

taxi_train %>% 
  ggplot(aes(x = hour, fill = tip)) +
  geom_bar()

taxi_train %>% 
  ggplot(aes(x = hour, fill = tip)) +
  geom_bar(position = "fill")

taxi_train %>% 
  ggplot(aes(x = distance)) +
  geom_histogram(bins = 100) +
  facet_grid(vars(tip))

Split smarter

Stratified sampling would split within response values

Stratification

Use strata = tip

set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8, strata = tip)
taxi_split
#> <Training/Testing/Total>
#> <8000/2000/10000>

Stratification

Stratification often helps, with very little downside

2 - Your data budget

Data on Chicago taxi trips

Data on Chicago taxi trips

Checklist for predictors

Data on Chicago taxi trips

Data splitting and spending

Data splitting and spending

The more datawe spend 🤑the better estimateswe’ll get.

Data splitting and spending

Your turn

The testing data is precious 💎

The initial split

What is set.seed()?

Accessing the data

The training set

The test set

Your turn

Data splitting and spending

What about a validation set?

Validation set

Exploratory data analysis for ML 🧐

Your turn

Split smarter

Stratification

Stratification

The whole game - status update

The more data
we spend 🤑

the better estimates
we’ll get.

What is `set.seed()`?