2 - Your data budget

Introduction to tidymodels

Data on forests in Washington

  • The U.S. Forest Service maintains ML models to predict whether a plot of land is “forested.”
  • This classification is important for all sorts of research, legislation, and land management purposes.
  • Plots are typically remeasured every 10 years and this dataset contains the most recent measurement per plot.
  • Type ?forested to learn more about this dataset, including references.

Data on forests in Washington

  • N = 7,107 plots of land, one from each of 7,107 6000-acre hexagons in WA.
  • A nominal outcome, forested, with levels "Yes" and "No", measured “on-the-ground.”
  • 18 remotely-sensed and easily-accessible predictors:
    • numeric variables based on weather and topography.
    • nominal variables based on classifications from other governmental orgs.

Checklist for predictors

  • Is it ethical to use this variable? (Or even legal?)

  • Will this variable be available at prediction time?

  • Does this variable contribute to explainability?

Data on forests in Washington

library(tidymodels)
library(forested)

forested
#> # A tibble: 7,107 × 19
#>    forested  year elevation eastness northness roughness tree_no_tree dew_temp
#>    <fct>    <dbl>     <dbl>    <dbl>     <dbl>     <dbl> <fct>           <dbl>
#>  1 Yes       2005       881       90        43        63 Tree             0.04
#>  2 Yes       2005       113      -25        96        30 Tree             6.4 
#>  3 No        2005       164      -84        53        13 Tree             6.06
#>  4 Yes       2005       299       93        34         6 No tree          4.43
#>  5 Yes       2005       806       47       -88        35 Tree             1.06
#>  6 Yes       2005       736      -27       -96        53 Tree             1.35
#>  7 Yes       2005       636      -48        87         3 No tree          1.42
#>  8 Yes       2005       224      -65       -75         9 Tree             6.39
#>  9 Yes       2005        52      -62        78        42 Tree             6.5 
#> 10 Yes       2005      2240      -67       -74        99 No tree         -5.63
#> # ℹ 7,097 more rows
#> # ℹ 11 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#> #   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#> #   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#> #   land_type <fct>

Data splitting and spending

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data
we spend 🤑

the better estimates
we’ll get.

Data splitting and spending

  • Spending too much data in training prevents us from computing a good assessment of predictive performance.
  • Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

When is a good time to split your data?

03:00

The testing data is precious 💎

The initial split

set.seed(123)
forested_split <- initial_split(forested)
forested_split
#> <Training/Testing/Total>
#> <5330/1777/7107>

What is set.seed()?

To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic given a “seed”.

This allows us to reproduce results by setting that seed.

Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.

Accessing the data

forested_train <- training(forested_split)
forested_test <- testing(forested_split)

The training set

forested_train
#> # A tibble: 5,330 × 19
#>    forested  year elevation eastness northness roughness tree_no_tree dew_temp
#>    <fct>    <dbl>     <dbl>    <dbl>     <dbl>     <dbl> <fct>           <dbl>
#>  1 No        2016       464       -5       -99         7 No tree          1.71
#>  2 Yes       2016       166       92        37         7 Tree             6   
#>  3 No        2016       644      -85       -52        24 No tree          0.67
#>  4 Yes       2014      1285        4        99        79 Tree             1.91
#>  5 Yes       2013       822       87        48        68 Tree             1.95
#>  6 Yes       2017         3        6       -99         5 Tree             7.93
#>  7 Yes       2014      2041      -95        28        49 Tree            -4.22
#>  8 Yes       2015      1009       -8        99        72 Tree             1.72
#>  9 No        2017       436      -98        19        10 No tree          1.8 
#> 10 No        2018       775       63        76       103 No tree          0.62
#> # ℹ 5,320 more rows
#> # ℹ 11 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,
#> #   temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,
#> #   vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,
#> #   land_type <fct>

The test set

🙈

There are 1777 rows and 19 columns in the test set.

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
forested_split <- initial_split(forested, prop = 0.8)
forested_train <- training(forested_split)
forested_test <- testing(forested_split)

nrow(forested_train)
#> [1] 5685
nrow(forested_test)
#> [1] 1422

Exploratory data analysis for ML 🧐

Your turn

Explore the forested_train data on your own!

  • What’s the distribution of the outcome, forested?
  • What’s the distribution of numeric variables like precip_annual?
  • How does the distribution of forested differ across the categorical variables?
08:00

forested_train %>% 
  ggplot(aes(x = forested)) +
  geom_bar()

forested_train %>% 
  ggplot(aes(x = forested, fill = tree_no_tree)) +
  geom_bar()

forested_train %>% 
  ggplot(aes(x = precip_annual, fill = forested, group = forested)) +
  geom_histogram(position = "identity", alpha = .7)

forested_train %>% 
  ggplot(aes(x = precip_annual, fill = forested, group = forested)) +
  geom_histogram(position = "fill")

forested_train %>% 
  ggplot(aes(x = lon, y = lat, col = forested)) +
  geom_point()

The whole game - status update