2 - Your data budget

Introduction to tidymodels

Data on Chicago taxi trips

  • The city of Chicago releases anonymized trip-level data on taxi trips in the city.
  • We pulled a sample of 10,000 rides occurring in early 2022.
  • Type ?taxi to learn more about this dataset, including references.

Data on Chicago taxi trips

  • N = 10,000
  • A nominal outcome, tip, with levels "yes" and "no"
  • Several nominal variables like pickup & dropoff location, taxi ID, and payment type.
  • Several numeric variables like trip length and fare subtotals.

Checklist for predictors

  • Is it ethical to use this variable? (Or even legal?)

  • Will this variable be available at prediction time?

  • Does this variable contribute to explainability?

Data on Chicago taxi trips

library(tidymodels)

taxi
#> # A tibble: 10,000 × 7
#>    tip   distance company                      local dow   month  hour
#>    <fct>    <dbl> <fct>                        <fct> <fct> <fct> <int>
#>  1 yes      17.2  Chicago Independents         no    Thu   Feb      16
#>  2 yes       0.88 City Service                 yes   Thu   Mar       8
#>  3 yes      18.1  other                        no    Mon   Feb      18
#>  4 yes      20.7  Chicago Independents         no    Mon   Apr       8
#>  5 yes      12.2  Chicago Independents         no    Sun   Mar      21
#>  6 yes       0.94 Sun Taxi                     yes   Sat   Apr      23
#>  7 yes      17.5  Flash Cab                    no    Fri   Mar      12
#>  8 yes      17.7  other                        no    Sun   Jan       6
#>  9 yes       1.85 Taxicab Insurance Agency Llc no    Fri   Apr      12
#> 10 yes       1.47 City Service                 no    Tue   Mar      14
#> # ℹ 9,990 more rows

Data splitting and spending

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data
we spend 🤑

the better estimates
we’ll get.

Data splitting and spending

  • Spending too much data in training prevents us from computing a good assessment of predictive performance.
  • Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

When is a good time to split your data?

03:00

The testing data is precious 💎

The initial split

set.seed(123)
taxi_split <- initial_split(taxi)
taxi_split
#> <Training/Testing/Total>
#> <7500/2500/10000>

What is set.seed()?

To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic give a “seed”.

This allows us to reproduce results by setting that seed.

Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.

Accessing the data

taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)

The training set

taxi_train
#> # A tibble: 7,500 × 7
#>    tip   distance company                   local dow   month  hour
#>    <fct>    <dbl> <fct>                     <fct> <fct> <fct> <int>
#>  1 yes       0.7  Taxi Affiliation Services yes   Tue   Mar      18
#>  2 yes       0.99 Sun Taxi                  yes   Tue   Jan       8
#>  3 yes       1.78 other                     no    Sat   Mar      22
#>  4 yes       0    Taxi Affiliation Services yes   Wed   Apr      15
#>  5 yes       0    Taxi Affiliation Services no    Sun   Jan      21
#>  6 yes       2.3  other                     no    Sat   Apr      21
#>  7 yes       6.35 Sun Taxi                  no    Wed   Mar      16
#>  8 yes       2.79 other                     no    Sun   Feb      14
#>  9 yes      16.6  other                     no    Sun   Apr      18
#> 10 yes       0.02 Chicago Independents      yes   Sun   Apr      15
#> # ℹ 7,490 more rows

The test set

🙈

There are 2500 rows and 7 columns in the test set.

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8)
taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)

nrow(taxi_train)
#> [1] 8000
nrow(taxi_test)
#> [1] 2000

What about a validation set?

Validation set

set.seed(123)
initial_validation_split(taxi, prop = c(0.6, 0.2))
#> <Training/Validation/Testing/Total>
#> <6000/2000/2000/10000>

Exploratory data analysis for ML 🧐

Your turn

Explore the taxi_train data on your own!

  • What’s the distribution of the outcome, tip?
  • What’s the distribution of numeric variables like distance?
  • How does tip differ across the categorical variables?
08:00

taxi_train %>% 
  ggplot(aes(x = tip)) +
  geom_bar()

taxi_train %>% 
  ggplot(aes(x = tip, fill = local)) +
  geom_bar() +
  scale_fill_viridis_d(end = .5)

taxi_train %>% 
  ggplot(aes(x = hour, fill = tip)) +
  geom_bar()

taxi_train %>% 
  ggplot(aes(x = hour, fill = tip)) +
  geom_bar(position = "fill")

taxi_train %>% 
  ggplot(aes(x = distance)) +
  geom_histogram(bins = 100) +
  facet_grid(vars(tip))

Split smarter

Stratified sampling would split within response values

Stratification

Use strata = tip

set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8, strata = tip)
taxi_split
#> <Training/Testing/Total>
#> <8000/2000/10000>

Stratification

Stratification often helps, with very little downside

The whole game - status update