Machine learning with tidymodels
?modeldatatoo::data_taxi()
to learn more about this dataset, including references.Is it ethical to use this variable? (Or even legal?)
Will this variable be available at prediction time?
Does this variable contribute to explainability?
We are using a slightly modified version from the modeldatatoo data.
N = 10,000
tip
, with levels "yes"
and "no"
company
, local
, and dow
, and month
are nominal predictorsdistance
and hours
are numeric predictorstaxi
#> # A tibble: 8,807 × 7
#> tip distance company local dow month hour
#> <fct> <dbl> <fct> <fct> <fct> <fct> <int>
#> 1 yes 1.24 Sun Taxi no Thu Feb 13
#> 2 no 5.39 Flash Cab no Sat Mar 12
#> 3 yes 3.01 City Service no Wed Feb 17
#> 4 no 18.4 Sun Taxi no Sat Apr 6
#> 5 yes 1.76 Sun Taxi no Sun Jan 15
#> 6 yes 13.6 Sun Taxi no Mon Feb 17
#> 7 yes 3.71 City Service no Mon Mar 21
#> 8 yes 4.8 other no Tue Mar 9
#> 9 yes 18.0 City Service no Fri Jan 19
#> 10 no 17.5 other yes Thu Apr 12
#> # ℹ 8,797 more rows
For machine learning, we typically split data into training and test sets:
Do not 🚫 use the test set during training.
When is a good time to split your data?
03:00
taxi_train
#> # A tibble: 6,605 × 7
#> tip distance company local dow month hour
#> <fct> <dbl> <fct> <fct> <fct> <fct> <int>
#> 1 yes 4.54 City Service no Sat Mar 16
#> 2 no 10.2 Flash Cab no Mon Feb 8
#> 3 yes 12.4 other no Sun Apr 15
#> 4 yes 15.3 Sun Taxi no Mon Apr 18
#> 5 no 6.41 Flash Cab no Wed Apr 14
#> 6 yes 1.56 other no Tue Jan 13
#> 7 yes 3.13 Flash Cab no Sun Apr 12
#> 8 yes 7.54 other no Tue Apr 8
#> 9 yes 6.98 Flash Cab no Tue Apr 5
#> 10 yes 0.7 Taxi Affiliation Services no Tue Jan 9
#> # ℹ 6,595 more rows
🙈
There are 2202 rows and 7 columns in the test set.
Split your data so 20% is held out for the test set.
Try out different values in set.seed()
to see how the results change.
05:00
Explore the taxi_train
data on your own!
08:00
Stratified sampling would split within response values
Use strata = tip
Stratification often helps, with very little downside