The city of Chicago releases anonymized trip-level data on taxi trips in the city.
We pulled a sample of 10,000 rides occurring in early 2022.
Type ?taxi to learn more about this dataset, including references.
Data on Chicago taxi trips
N = 10,000
A nominal outcome, tip, with levels "yes" and "no"
Several nominal variables like pickup & dropoff location, taxi ID, and payment type.
Several numeric variables like trip length and fare subtotals.
Checklist for predictors
Is it ethical to use this variable? (Or even legal?)
Will this variable be available at prediction time?
Does this variable contribute to explainability?
Data on Chicago taxi trips
library(tidymodels)taxi#> # A tibble: 10,000 × 7#> tip distance company local dow month hour#> <fct> <dbl> <fct> <fct> <fct> <fct> <int>#> 1 yes 17.2 Chicago Independents no Thu Feb 16#> 2 yes 0.88 City Service yes Thu Mar 8#> 3 yes 18.1 other no Mon Feb 18#> 4 yes 20.7 Chicago Independents no Mon Apr 8#> 5 yes 12.2 Chicago Independents no Sun Mar 21#> 6 yes 0.94 Sun Taxi yes Sat Apr 23#> 7 yes 17.5 Flash Cab no Fri Mar 12#> 8 yes 17.7 other no Sun Jan 6#> 9 yes 1.85 Taxicab Insurance Agency Llc no Fri Apr 12#> 10 yes 1.47 City Service no Tue Mar 14#> # ℹ 9,990 more rows
Data splitting and spending
For machine learning, we typically split data into training and test sets:
The training set is used to estimate model parameters.
The test set is used to find an independent assessment of model performance.
Do not 🚫 use the test set during training.
Data splitting and spending
The more data we spend 🤑
the better estimates we’ll get.
Data splitting and spending
Spending too much data in training prevents us from computing a good assessment of predictive performance.
Spending too much data in testing prevents us from computing a good estimate of model parameters.
To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic give a “seed”.
This allows us to reproduce results by setting that seed.
Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.
taxi_train#> # A tibble: 7,500 × 7#> tip distance company local dow month hour#> <fct> <dbl> <fct> <fct> <fct> <fct> <int>#> 1 yes 0.7 Taxi Affiliation Services yes Tue Mar 18#> 2 yes 0.99 Sun Taxi yes Tue Jan 8#> 3 yes 1.78 other no Sat Mar 22#> 4 yes 0 Taxi Affiliation Services yes Wed Apr 15#> 5 yes 0 Taxi Affiliation Services no Sun Jan 21#> 6 yes 2.3 other no Sat Apr 21#> 7 yes 6.35 Sun Taxi no Wed Mar 16#> 8 yes 2.79 other no Sun Feb 14#> 9 yes 16.6 other no Sun Apr 18#> 10 yes 0.02 Chicago Independents yes Sun Apr 15#> # ℹ 7,490 more rows
The test set
🙈
There are 2500 rows and 7 columns in the test set.
Your turn
Split your data so 20% is held out for the test set.
Try out different values in set.seed() to see how the results change.