2 - Your data budget

Machine learning with tidymodels

Data on tree frog hatching

Red-eyed tree frog embryos can hatch earlier than their normal ~7 days if they detect potential predator threat!
Type ?stacks::tree_frogs to learn more about this dataset, including references.
We are using a slightly modified version from stacks.

library(tidymodels)

data("tree_frogs", package = "stacks")
tree_frogs <- tree_frogs %>%
  mutate(t_o_d = factor(t_o_d),
         age = age / 86400) %>%
  filter(!is.na(latency)) %>%
  select(-c(clutch, hatched))

Data on tree frog hatching

N = 572
A numeric outcome, latency
4 other variables
- treatment, reflex, and t_o_d are nominal predictors
- age is a numeric predictor

Data on tree frog hatching

tree_frogs
#> # A tibble: 572 × 5
#>    treatment  reflex   age t_o_d     latency
#>    <chr>      <fct>  <dbl> <fct>       <dbl>
#>  1 control    full    5.40 morning        22
#>  2 control    low     4.18 night         360
#>  3 control    full    4.65 afternoon     106
#>  4 control    mid     4.14 night         180
#>  5 control    full    4.6  afternoon      60
#>  6 gentamicin full    5.36 morning        39
#>  7 control    full    4.56 afternoon     214
#>  8 control    full    5.43 morning        50
#>  9 control    full    4.63 afternoon     224
#> 10 control    full    5.40 morning        63
#> # … with 562 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Data splitting and spending

For machine learning, we typically split data into training and test sets:

The training set is used to estimate model parameters.
The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data
we spend 🤑

the better estimates
we’ll get.

Data splitting and spending

Spending too much data in training prevents us from computing a good assessment of predictive performance.

Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

When is a good time to split your data?

03:00

The testing data is precious 💎

Data splitting and spending

set.seed(123)
frog_split <- initial_split(tree_frogs)
frog_split
#> <Training/Testing/Total>
#> <429/143/572>

Accessing the data

frog_train <- training(frog_split)
frog_test <- testing(frog_split)

The training set

frog_train
#> # A tibble: 429 × 5
#>    treatment  reflex   age t_o_d     latency
#>    <chr>      <fct>  <dbl> <fct>       <dbl>
#>  1 control    full    5.36 morning        36
#>  2 gentamicin full    5.37 morning        72
#>  3 gentamicin full    4.65 afternoon     141
#>  4 control    full    5.42 morning        27
#>  5 control    full    5.43 morning        27
#>  6 gentamicin full    5.38 morning        73
#>  7 gentamicin full    5.42 morning        68
#>  8 gentamicin full    4.75 afternoon     124
#>  9 control    full    5.00 night          62
#> 10 control    full    5.39 morning        25
#> # … with 419 more rows
#> # ℹ Use `print(n = ...)` to see more rows

The test set

frog_test
#> # A tibble: 143 × 5
#>    treatment  reflex   age t_o_d     latency
#>    <chr>      <fct>  <dbl> <fct>       <dbl>
#>  1 control    full    5.40 morning        22
#>  2 control    low     4.18 night         360
#>  3 control    full    4.63 afternoon     224
#>  4 gentamicin full    4.75 afternoon     158
#>  5 control    mid     4.22 night          91
#>  6 gentamicin full    4.89 night         301
#>  7 control    full    5.38 morning         2
#>  8 control    full    4.80 afternoon      56
#>  9 control    full    5.36 morning        11
#> 10 control    full    5.40 morning        64
#> # … with 133 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
frog_split <- initial_split(tree_frogs, prop = 0.8)
frog_train <- training(frog_split)
frog_test <- testing(frog_split)

nrow(frog_train)
#> [1] 457
nrow(frog_test)
#> [1] 115

What about a validation set?

Exploratory data analysis for ML 🧐

Your turn

Explore the frog_train data on your own!

What’s the distribution of the outcome, latency?
What’s the distribution of numeric variables like age?
How does latency differ across the categorical variables?

08:00

ggplot(frog_train, aes(latency)) +
  geom_histogram(bins = 20)

ggplot(frog_train, aes(latency, treatment, fill = treatment)) +
  geom_boxplot(alpha = 0.5, show.legend = FALSE)

frog_train %>%
  ggplot(aes(latency, reflex, fill = reflex)) +
  geom_boxplot(alpha = 0.3, show.legend = FALSE)

ggplot(frog_train, aes(age, latency, color = reflex)) +
  geom_point(alpha = .8, size = 2)

Split smarter

Stratified sampling would split within each quartile

Stratification

Use strata = latency

set.seed(123)
frog_split <- initial_split(tree_frogs, prop = 0.8, strata = latency)
frog_split
#> <Training/Testing/Total>
#> <456/116/572>

Stratification often helps, with very little downside

2 - Your data budget

Data on tree frog hatching

Data on tree frog hatching

Data on tree frog hatching

Data on tree frog hatching

Data splitting and spending

Data splitting and spending

The more datawe spend 🤑the better estimateswe’ll get.

Data splitting and spending

Your turn

The testing data is precious 💎

Data splitting and spending

Accessing the data

The training set

The test set

Your turn

Data splitting and spending

What about a validation set?

Exploratory data analysis for ML 🧐

Your turn

Split smarter

Stratification

The more data
we spend 🤑

the better estimates
we’ll get.