2 - Your data budget

Machine learning with tidymodels

Data on tree frog hatching

Data on tree frog hatching

  • Red-eyed tree frog embryos can hatch earlier than their normal ~7 days if they detect potential predator threat!
  • Type ?stacks::tree_frogs to learn more about this dataset, including references.
  • We are using a slightly modified version from stacks.
library(tidymodels)

data("tree_frogs", package = "stacks")
tree_frogs <- tree_frogs %>%
  mutate(t_o_d = factor(t_o_d),
         age = age / 86400) %>%
  filter(!is.na(latency)) %>%
  select(-c(clutch, hatched))

Data on tree frog hatching

  • N = 572
  • A numeric outcome, latency
  • 4 other variables
    • treatment, reflex, and t_o_d are nominal predictors
    • age is a numeric predictor

Data on tree frog hatching

tree_frogs
#> # A tibble: 572 × 5
#>    treatment  reflex   age t_o_d     latency
#>    <chr>      <fct>  <dbl> <fct>       <dbl>
#>  1 control    full    5.40 morning        22
#>  2 control    low     4.18 night         360
#>  3 control    full    4.65 afternoon     106
#>  4 control    mid     4.14 night         180
#>  5 control    full    4.6  afternoon      60
#>  6 gentamicin full    5.36 morning        39
#>  7 control    full    4.56 afternoon     214
#>  8 control    full    5.43 morning        50
#>  9 control    full    4.63 afternoon     224
#> 10 control    full    5.40 morning        63
#> # … with 562 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Data splitting and spending

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data
we spend 🤑

the better estimates
we’ll get.

Data splitting and spending

  • Spending too much data in training prevents us from computing a good assessment of predictive performance.
  • Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

When is a good time to split your data?

03:00

The testing data is precious 💎

Data splitting and spending

set.seed(123)
frog_split <- initial_split(tree_frogs)
frog_split
#> <Training/Testing/Total>
#> <429/143/572>

Accessing the data

frog_train <- training(frog_split)
frog_test <- testing(frog_split)

The training set

frog_train
#> # A tibble: 429 × 5
#>    treatment  reflex   age t_o_d     latency
#>    <chr>      <fct>  <dbl> <fct>       <dbl>
#>  1 control    full    5.36 morning        36
#>  2 gentamicin full    5.37 morning        72
#>  3 gentamicin full    4.65 afternoon     141
#>  4 control    full    5.42 morning        27
#>  5 control    full    5.43 morning        27
#>  6 gentamicin full    5.38 morning        73
#>  7 gentamicin full    5.42 morning        68
#>  8 gentamicin full    4.75 afternoon     124
#>  9 control    full    5.00 night          62
#> 10 control    full    5.39 morning        25
#> # … with 419 more rows
#> # ℹ Use `print(n = ...)` to see more rows

The test set

frog_test
#> # A tibble: 143 × 5
#>    treatment  reflex   age t_o_d     latency
#>    <chr>      <fct>  <dbl> <fct>       <dbl>
#>  1 control    full    5.40 morning        22
#>  2 control    low     4.18 night         360
#>  3 control    full    4.63 afternoon     224
#>  4 gentamicin full    4.75 afternoon     158
#>  5 control    mid     4.22 night          91
#>  6 gentamicin full    4.89 night         301
#>  7 control    full    5.38 morning         2
#>  8 control    full    4.80 afternoon      56
#>  9 control    full    5.36 morning        11
#> 10 control    full    5.40 morning        64
#> # … with 133 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
frog_split <- initial_split(tree_frogs, prop = 0.8)
frog_train <- training(frog_split)
frog_test <- testing(frog_split)

nrow(frog_train)
#> [1] 457
nrow(frog_test)
#> [1] 115

What about a validation set?

Exploratory data analysis for ML 🧐

Your turn

Explore the frog_train data on your own!

  • What’s the distribution of the outcome, latency?
  • What’s the distribution of numeric variables like age?
  • How does latency differ across the categorical variables?
08:00

ggplot(frog_train, aes(latency)) +
  geom_histogram(bins = 20)

ggplot(frog_train, aes(latency, treatment, fill = treatment)) +
  geom_boxplot(alpha = 0.5, show.legend = FALSE)

frog_train %>%
  ggplot(aes(latency, reflex, fill = reflex)) +
  geom_boxplot(alpha = 0.3, show.legend = FALSE)

ggplot(frog_train, aes(age, latency, color = reflex)) +
  geom_point(alpha = .8, size = 2)

Split smarter

Stratified sampling would split within each quartile

Stratification

Use strata = latency

set.seed(123)
frog_split <- initial_split(tree_frogs, prop = 0.8, strata = latency)
frog_split
#> <Training/Testing/Total>
#> <456/116/572>

Stratification often helps, with very little downside