To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic given a “seed”.
This allows us to reproduce results by setting that seed.
Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.
forested_train#> # A tibble: 8,202 × 19#> forested year elevation eastness roughness tree_no_tree dew_temp#> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>#> 1 Yes 1997 66 82 10 Tree 12.2 #> 2 No 1997 284 -99 58 Tree 10.3 #> 3 Yes 2022 130 86 15 Tree 11.8 #> 4 Yes 2021 202 -55 3 Tree 10.7 #> 5 Yes 1995 75 -89 1 Tree 13.8 #> 6 No 1995 110 -53 5 Tree 12.4 #> 7 Yes 2022 111 73 12 Tree 11.5 #> 8 Yes 1997 230 96 14 Tree 9.98#> 9 Yes 2002 160 -88 13 Tree 11.1 #> 10 Yes 2020 39 9 6 Tree 13.9 #> # ℹ 8,192 more rows#> # ℹ 12 more variables: precip_annual <dbl>, temp_annual_mean <dbl>,#> # temp_annual_min <dbl>, temp_annual_max <dbl>, temp_january_min <dbl>,#> # vapor_min <dbl>, vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>,#> # land_type <fct>, county <fct>
The training set
forested_train |>select(where(is.factor))#> # A tibble: 8,202 × 4#> forested tree_no_tree land_type county #> <fct> <fct> <fct> <fct> #> 1 Yes Tree Tree Muscogee #> 2 No Tree Tree Polk #> 3 Yes Tree Tree Hancock #> 4 Yes Tree Tree Oglethorpe#> 5 Yes Tree Tree Berrien #> 6 No Tree Non-tree vegetation Dooly #> 7 Yes Tree Tree Columbia #> 8 Yes Tree Tree Walker #> 9 Yes Tree Tree Greene #> 10 Yes Tree Tree Wayne #> # ℹ 8,192 more rows
The test set
🙈
There are 2735 rows and 19 columns in the test set.
Your turn
Split your data so 20% is held out for the test set.
Try out different values in set.seed() to see how the results change.
We recommend using the .qmd files in the classwork/ folder for code exercises. They set you up with the code from the slides.