2 - Your data budget

Introduction to tidymodels

Data on forests in Georgia

The U.S. Forest Service maintains ML models to predict whether a plot of land is “forested.”
This classification is important for all sorts of research, legislation, and land management purposes.
Plots are typically remeasured every 10 years and this dataset contains the most recent measurement per plot.
Type ?forested_ga to learn more about this dataset, including references.

Data on forests in Georgia

N = 10937 plots of land, one from each of 10937 6000-acre hexagons in Georgia.
A nominal outcome, forested, with levels "Yes" and "No", measured “on-the-ground.”
18 remotely-sensed and easily-accessible predictors:
- numeric variables based on weather and topography.
- nominal variables based on classifications from other governmental orgs.

Checklist for predictors

Is it ethical to use this variable? (Or even legal?)
Will this variable be available at prediction time?
Does this variable contribute to explainability?

Data on forests in Georgia

library(tidymodels)
library(forested)

forested_ga
#> # A tibble: 10,937 × 19
#>    forested  year elevation eastness roughness tree_no_tree dew_temp
#>    <fct>    <dbl>     <dbl>    <dbl>     <dbl> <fct>           <dbl>
#>  1 Yes       2007        14        0         0 No tree          13.9
#>  2 Yes       2007        66      -53        10 Tree             13.8
#>  3 Yes       2006        59      -82         6 No tree          13.5
#>  4 Yes       2007       116      -78        20 Tree             12.3
#>  5 Yes       2006       283       63        13 Tree             10.0
#>  6 Yes       2007       250       63        14 Tree             10.8
#>  7 Yes       2007        58       31         1 No tree          13.8
#>  8 Yes       2023       140       56        11 Tree             12.2
#>  9 Yes       2024       118       72        17 Tree             12.2
#> 10 Yes       2024       217      -46        13 Tree             12.4
#> # ℹ 10,927 more rows
#> # ℹ 12 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#> #   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#> #   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#> #   land_type <fct>, county <fct>

Data splitting and spending

For machine learning, we typically split data into training and test sets:

The training set is used to estimate model parameters.
The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data
we spend 🤑

the better estimates
we’ll get.

Data splitting and spending

Spending too much data in training prevents us from computing a good assessment of predictive performance.

Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

When is a good time to split your data?

03:00

The testing data is precious 💎

The initial split

set.seed(123)
forested_split <- initial_split(forested_ga)
forested_split
#> <Training/Testing/Total>
#> <8202/2735/10937>

What is `set.seed()`?

To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic given a “seed”.

This allows us to reproduce results by setting that seed.

Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.

Accessing the data

forested_train <- training(forested_split)
forested_test <- testing(forested_split)

The training set

forested_train
#> # A tibble: 8,202 × 19
#>    forested  year elevation eastness roughness tree_no_tree dew_temp
#>    <fct>    <dbl>     <dbl>    <dbl>     <dbl> <fct>           <dbl>
#>  1 Yes       1997        66       82        10 Tree            12.2 
#>  2 No        1997       284      -99        58 Tree            10.3 
#>  3 Yes       2022       130       86        15 Tree            11.8 
#>  4 Yes       2021       202      -55         3 Tree            10.7 
#>  5 Yes       1995        75      -89         1 Tree            13.8 
#>  6 No        1995       110      -53         5 Tree            12.4 
#>  7 Yes       2022       111       73        12 Tree            11.5 
#>  8 Yes       1997       230       96        14 Tree             9.98
#>  9 Yes       2002       160      -88        13 Tree            11.1 
#> 10 Yes       2020        39        9         6 Tree            13.9 
#> # ℹ 8,192 more rows
#> # ℹ 12 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#> #   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#> #   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#> #   land_type <fct>, county <fct>

The test set

🙈

There are 2735 rows and 19 columns in the test set.

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
forested_split <- initial_split(forested_ga, prop = 0.8)
forested_train <- training(forested_split)
forested_test <- testing(forested_split)

nrow(forested_train)
#> [1] 8749
nrow(forested_test)
#> [1] 2188

Exploratory data analysis for ML 🧐

Your turn

Explore the forested_train data on your own!

What’s the distribution of the outcome, forested?
What’s the distribution of numeric variables like precip_annual?
How does the distribution of forested differ across the categorical variables?

08:00

forested_train |> 
  ggplot(aes(x = forested)) +
  geom_bar()

forested_train |> 
  ggplot(aes(x = forested, fill = tree_no_tree)) +
  geom_bar()

forested_train |> 
  ggplot(aes(x = precip_annual, fill = forested, group = forested)) +
  geom_histogram(position = "identity", alpha = .7)

forested_train |> 
  ggplot(aes(x = precip_annual, fill = forested, group = forested)) +
  geom_histogram(position = "fill")

forested_train |> 
  ggplot(aes(x = lon, y = lat, col = forested)) +
  geom_point()

2 - Your data budget

Data on forests in Georgia

Data on forests in Georgia

Checklist for predictors

Data on forests in Georgia

Data splitting and spending

Data splitting and spending

The more datawe spend 🤑the better estimateswe’ll get.

Data splitting and spending

Your turn

The testing data is precious 💎

The initial split

What is set.seed()?

Accessing the data

The training set

The test set

Your turn

Data splitting and spending

Exploratory data analysis for ML 🧐

Your turn

The whole game - status update

The more data
we spend 🤑

the better estimates
we’ll get.

What is `set.seed()`?