2 - Your data budget

Introduction to tidymodels

Data on forests in Georgia

  • The U.S. Forest Service maintains ML models to predict whether a plot of land is “forested.”
  • This classification is important for all sorts of research, legislation, and land management purposes.
  • Plots are typically remeasured every 10 years and this dataset contains the most recent measurement per plot.
  • Type ?forested_ga to learn more about this dataset, including references.

Data on forests in Georgia

  • N = 10937 plots of land, one from each of 10937 6000-acre hexagons in Georgia.
  • A nominal outcome, forested, with levels "Yes" and "No", measured “on-the-ground.”
  • 18 remotely-sensed and easily-accessible predictors:
    • numeric variables based on weather and topography.
    • nominal variables based on classifications from other governmental orgs.

Checklist for predictors

  • Is it ethical to use this variable? (Or even legal?)

  • Will this variable be available at prediction time?

  • Does this variable contribute to explainability?

Data on forests in Georgia

library(tidymodels)
library(forested)

forested_ga
#> # A tibble: 10,937 × 19
#>    forested  year elevation eastness roughness tree_no_tree dew_temp
#>    <fct>    <dbl>     <dbl>    <dbl>     <dbl> <fct>           <dbl>
#>  1 Yes       2007        14        0         0 No tree          13.9
#>  2 Yes       2007        66      -53        10 Tree             13.8
#>  3 Yes       2006        59      -82         6 No tree          13.5
#>  4 Yes       2007       116      -78        20 Tree             12.3
#>  5 Yes       2006       283       63        13 Tree             10.0
#>  6 Yes       2007       250       63        14 Tree             10.8
#>  7 Yes       2007        58       31         1 No tree          13.8
#>  8 Yes       2023       140       56        11 Tree             12.2
#>  9 Yes       2024       118       72        17 Tree             12.2
#> 10 Yes       2024       217      -46        13 Tree             12.4
#> # ℹ 10,927 more rows
#> # ℹ 12 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#> #   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#> #   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#> #   land_type <fct>, county <fct>

Data splitting and spending

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data
we spend 🤑

the better estimates
we’ll get.

Data splitting and spending

  • Spending too much data in training prevents us from computing a good assessment of predictive performance.
  • Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

When is a good time to split your data?

03:00

The testing data is precious 💎

The initial split

set.seed(123)
forested_split <- initial_split(forested_ga)
forested_split
#> <Training/Testing/Total>
#> <8202/2735/10937>

What is set.seed()?

To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic given a “seed”.

This allows us to reproduce results by setting that seed.

Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.

Accessing the data

forested_train <- training(forested_split)
forested_test <- testing(forested_split)

The training set

forested_train
#> # A tibble: 8,202 × 19
#>    forested  year elevation eastness roughness tree_no_tree dew_temp
#>    <fct>    <dbl>     <dbl>    <dbl>     <dbl> <fct>           <dbl>
#>  1 Yes       1997        66       82        10 Tree            12.2 
#>  2 No        1997       284      -99        58 Tree            10.3 
#>  3 Yes       2022       130       86        15 Tree            11.8 
#>  4 Yes       2021       202      -55         3 Tree            10.7 
#>  5 Yes       1995        75      -89         1 Tree            13.8 
#>  6 No        1995       110      -53         5 Tree            12.4 
#>  7 Yes       2022       111       73        12 Tree            11.5 
#>  8 Yes       1997       230       96        14 Tree             9.98
#>  9 Yes       2002       160      -88        13 Tree            11.1 
#> 10 Yes       2020        39        9         6 Tree            13.9 
#> # ℹ 8,192 more rows
#> # ℹ 12 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#> #   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#> #   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#> #   land_type <fct>, county <fct>

The test set

🙈

There are 2735 rows and 19 columns in the test set.

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
forested_split <- initial_split(forested_ga, prop = 0.8)
forested_train <- training(forested_split)
forested_test <- testing(forested_split)

nrow(forested_train)
#> [1] 8749
nrow(forested_test)
#> [1] 2188

Exploratory data analysis for ML 🧐

Your turn

Explore the forested_train data on your own!

  • What’s the distribution of the outcome, forested?
  • What’s the distribution of numeric variables like precip_annual?
  • How does the distribution of forested differ across the categorical variables?
08:00

forested_train |> 
  ggplot(aes(x = forested)) +
  geom_bar()

forested_train |> 
  ggplot(aes(x = forested, fill = tree_no_tree)) +
  geom_bar()

forested_train |> 
  ggplot(aes(x = precip_annual, fill = forested, group = forested)) +
  geom_histogram(position = "identity", alpha = .7)

forested_train |> 
  ggplot(aes(x = precip_annual, fill = forested, group = forested)) +
  geom_histogram(position = "fill")

forested_train |> 
  ggplot(aes(x = lon, y = lat, col = forested)) +
  geom_point()

The whole game - status update