Machine learning with tidymodels
?stacks::tree_frogs
to learn more about this dataset, including references.N = 572
latency
treatment
, reflex
, and t_o_d
are nominal predictorsage
is a numeric predictortree_frogs
#> # A tibble: 572 × 5
#> treatment reflex age t_o_d latency
#> <chr> <fct> <dbl> <fct> <dbl>
#> 1 control full 5.40 morning 22
#> 2 control low 4.18 night 360
#> 3 control full 4.65 afternoon 106
#> 4 control mid 4.14 night 180
#> 5 control full 4.6 afternoon 60
#> 6 gentamicin full 5.36 morning 39
#> 7 control full 4.56 afternoon 214
#> 8 control full 5.43 morning 50
#> 9 control full 4.63 afternoon 224
#> 10 control full 5.40 morning 63
#> # … with 562 more rows
#> # ℹ Use `print(n = ...)` to see more rows
For machine learning, we typically split data into training and test sets:
Do not 🚫 use the test set during training.
When is a good time to split your data?
03:00
frog_train
#> # A tibble: 429 × 5
#> treatment reflex age t_o_d latency
#> <chr> <fct> <dbl> <fct> <dbl>
#> 1 control full 5.36 morning 36
#> 2 gentamicin full 5.37 morning 72
#> 3 gentamicin full 4.65 afternoon 141
#> 4 control full 5.42 morning 27
#> 5 control full 5.43 morning 27
#> 6 gentamicin full 5.38 morning 73
#> 7 gentamicin full 5.42 morning 68
#> 8 gentamicin full 4.75 afternoon 124
#> 9 control full 5.00 night 62
#> 10 control full 5.39 morning 25
#> # … with 419 more rows
#> # ℹ Use `print(n = ...)` to see more rows
frog_test
#> # A tibble: 143 × 5
#> treatment reflex age t_o_d latency
#> <chr> <fct> <dbl> <fct> <dbl>
#> 1 control full 5.40 morning 22
#> 2 control low 4.18 night 360
#> 3 control full 4.63 afternoon 224
#> 4 gentamicin full 4.75 afternoon 158
#> 5 control mid 4.22 night 91
#> 6 gentamicin full 4.89 night 301
#> 7 control full 5.38 morning 2
#> 8 control full 4.80 afternoon 56
#> 9 control full 5.36 morning 11
#> 10 control full 5.40 morning 64
#> # … with 133 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Split your data so 20% is held out for the test set.
Try out different values in set.seed()
to see how the results change.
05:00
Explore the frog_train
data on your own!
08:00
Stratified sampling would split within each quartile
Use strata = latency
Stratification often helps, with very little downside