3 - What makes a model?

Introduction to tidymodels

Your turn

How do you fit a linear model in R?

How many different ways can you think of?

03:00

lm for linear model
glmnet for regularized regression
keras for regression using TensorFlow
stan for Bayesian regression
spark for large data sets
brulee for regression using torch

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

logistic_reg()
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

logistic_reg() |>
  set_engine("glmnet")
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glmnet

To specify a model

logistic_reg() |>
  set_engine("stan")
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: stan

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

decision_tree()
#> Decision Tree Model Specification (unknown mode)
#> 
#> Computational engine: rpart

To specify a model

decision_tree() |> 
  set_mode("classification")
#> Decision Tree Model Specification (classification)
#> 
#> Computational engine: rpart

All available models are listed at https://www.tidymodels.org/find/parsnip/

To specify a model

Choose a model
Specify an engine
Set the mode

Your turn

Run the tree_spec chunk in your .qmd.

Edit this code to use a logistic regression model.

All available models are listed at https://www.tidymodels.org/find/parsnip/

Extension/Challenge: Edit this code to use a different model. For example, try using a conditional inference tree as implemented in the partykit package by changing the engine - or try an entirely different model type!

05:00

Models we’ll be using today

Logistic regression
Decision trees

Logistic regression

Logit of outcome probability modeled as linear combination of predictors:

\(log(\frac{p}{1 - p}) = \beta_0 + \beta_1\cdot \text{A}\)

Find a sigmoid line that separates the two classes

Decision trees

Series of splits or if/then statements based on predictors
First the tree grows until some condition is met (maximum depth, no more data)
Then the tree is pruned to reduce its complexity

Decision trees

All models are wrong, but some are useful!

Logistic regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a `workflow()`?

Workflows handle new data better than base R tools in terms of new factor levels

You can use other preprocessors besides formulas (more on feature engineering in Advanced tidymodels!)

They can help organize your work when working with multiple models

Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit

A model workflow

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

tree_spec |> 
  fit(forested ~ ., data = forested_train) 
#> parsnip model object
#> 
#> n= 8749 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#> 1) root 8749 2427 Yes (0.7225969 0.2774031)  
#>   2) county=Appling,Atkinson,Bacon,Baldwin,Ben Hill,Brantley,Brooks,Bryan,Bulloch,Burke,Butts,Camden,Candler,Carroll,Charlton,Chattahoochee,Chattooga,Cherokee,Clinch,Coffee,Coweta,Crawford,Dade,Dawson,Dodge,Dougherty,Douglas,Echols,Effingham,Elbert,Emanuel,Evans,Fannin,Floyd,Gilmer,Glascock,Greene,Habersham,Hancock,Haralson,Harris,Heard,Jasper,Jeff Davis,Jefferson,Jenkins,Johnson,Jones,Lamar,Lanier,Laurens,Lee,Lincoln,Long,Lumpkin,Marion,McDuffie,Meriwether,Monroe,Montgomery,Morgan,Murray,Oconee,Oglethorpe,Paulding,Pickens,Pierce,Pike,Polk,Putnam,Quitman,Rabun,Randolph,Schley,Screven,Spalding,Stephens,Stewart,Talbot,Taliaferro,Tattnall,Taylor,Telfair,Terrell,Towns,Treutlen,Troup,Twiggs,Union,Upson,Walker,Ware,Warren,Washington,Wayne,Webster,Wheeler,White,Wilcox,Wilkes,Wilkinson 5598 1005 Yes (0.8204716 0.1795284) *
#>   3) county=Baker,Banks,Barrow,Bartow,Berrien,Bibb,Bleckley,Calhoun,Catoosa,Chatham,Clarke,Clay,Clayton,Cobb,Colquitt,Columbia,Cook,Crisp,Decatur,DeKalb,Dooly,Early,Fayette,Forsyth,Franklin,Fulton,Glynn,Gordon,Grady,Gwinnett,Hall,Hart,Henry,Houston,Irwin,Jackson,Liberty,Lowndes,Macon,Madison,McIntosh,Miller,Mitchell,Muscogee,Newton,Peach,Pulaski,Richmond,Rockdale,Seminole,Sumter,Thomas,Tift,Toombs,Turner,Walton,Whitfield,Worth 3151 1422 Yes (0.5487147 0.4512853)  
#>     6) canopy_cover>=41.5 1773  603 Yes (0.6598985 0.3401015) *
#>     7) canopy_cover< 41.5 1378  559 No (0.4056604 0.5943396) *

A model workflow

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

workflow() |>
  add_formula(forested ~ .) |>
  add_model(tree_spec) |>
  fit(data = forested_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> forested ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 8749 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#> 1) root 8749 2427 Yes (0.7225969 0.2774031)  
#>   2) county=Appling,Atkinson,Bacon,Baldwin,Ben Hill,Brantley,Brooks,Bryan,Bulloch,Burke,Butts,Camden,Candler,Carroll,Charlton,Chattahoochee,Chattooga,Cherokee,Clinch,Coffee,Coweta,Crawford,Dade,Dawson,Dodge,Dougherty,Douglas,Echols,Effingham,Elbert,Emanuel,Evans,Fannin,Floyd,Gilmer,Glascock,Greene,Habersham,Hancock,Haralson,Harris,Heard,Jasper,Jeff Davis,Jefferson,Jenkins,Johnson,Jones,Lamar,Lanier,Laurens,Lee,Lincoln,Long,Lumpkin,Marion,McDuffie,Meriwether,Monroe,Montgomery,Morgan,Murray,Oconee,Oglethorpe,Paulding,Pickens,Pierce,Pike,Polk,Putnam,Quitman,Rabun,Randolph,Schley,Screven,Spalding,Stephens,Stewart,Talbot,Taliaferro,Tattnall,Taylor,Telfair,Terrell,Towns,Treutlen,Troup,Twiggs,Union,Upson,Walker,Ware,Warren,Washington,Wayne,Webster,Wheeler,White,Wilcox,Wilkes,Wilkinson 5598 1005 Yes (0.8204716 0.1795284) *
#>   3) county=Baker,Banks,Barrow,Bartow,Berrien,Bibb,Bleckley,Calhoun,Catoosa,Chatham,Clarke,Clay,Clayton,Cobb,Colquitt,Columbia,Cook,Crisp,Decatur,DeKalb,Dooly,Early,Fayette,Forsyth,Franklin,Fulton,Glynn,Gordon,Grady,Gwinnett,Hall,Hart,Henry,Houston,Irwin,Jackson,Liberty,Lowndes,Macon,Madison,McIntosh,Miller,Mitchell,Muscogee,Newton,Peach,Pulaski,Richmond,Rockdale,Seminole,Sumter,Thomas,Tift,Toombs,Turner,Walton,Whitfield,Worth 3151 1422 Yes (0.5487147 0.4512853)  
#>     6) canopy_cover>=41.5 1773  603 Yes (0.6598985 0.3401015) *
#>     7) canopy_cover< 41.5 1378  559 No (0.4056604 0.5943396) *

A model workflow

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

workflow(forested ~ ., tree_spec) |> 
  fit(data = forested_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> forested ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 8749 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#> 1) root 8749 2427 Yes (0.7225969 0.2774031)  
#>   2) county=Appling,Atkinson,Bacon,Baldwin,Ben Hill,Brantley,Brooks,Bryan,Bulloch,Burke,Butts,Camden,Candler,Carroll,Charlton,Chattahoochee,Chattooga,Cherokee,Clinch,Coffee,Coweta,Crawford,Dade,Dawson,Dodge,Dougherty,Douglas,Echols,Effingham,Elbert,Emanuel,Evans,Fannin,Floyd,Gilmer,Glascock,Greene,Habersham,Hancock,Haralson,Harris,Heard,Jasper,Jeff Davis,Jefferson,Jenkins,Johnson,Jones,Lamar,Lanier,Laurens,Lee,Lincoln,Long,Lumpkin,Marion,McDuffie,Meriwether,Monroe,Montgomery,Morgan,Murray,Oconee,Oglethorpe,Paulding,Pickens,Pierce,Pike,Polk,Putnam,Quitman,Rabun,Randolph,Schley,Screven,Spalding,Stephens,Stewart,Talbot,Taliaferro,Tattnall,Taylor,Telfair,Terrell,Towns,Treutlen,Troup,Twiggs,Union,Upson,Walker,Ware,Warren,Washington,Wayne,Webster,Wheeler,White,Wilcox,Wilkes,Wilkinson 5598 1005 Yes (0.8204716 0.1795284) *
#>   3) county=Baker,Banks,Barrow,Bartow,Berrien,Bibb,Bleckley,Calhoun,Catoosa,Chatham,Clarke,Clay,Clayton,Cobb,Colquitt,Columbia,Cook,Crisp,Decatur,DeKalb,Dooly,Early,Fayette,Forsyth,Franklin,Fulton,Glynn,Gordon,Grady,Gwinnett,Hall,Hart,Henry,Houston,Irwin,Jackson,Liberty,Lowndes,Macon,Madison,McIntosh,Miller,Mitchell,Muscogee,Newton,Peach,Pulaski,Richmond,Rockdale,Seminole,Sumter,Thomas,Tift,Toombs,Turner,Walton,Whitfield,Worth 3151 1422 Yes (0.5487147 0.4512853)  
#>     6) canopy_cover>=41.5 1773  603 Yes (0.6598985 0.3401015) *
#>     7) canopy_cover< 41.5 1378  559 No (0.4056604 0.5943396) *

Your turn

Run the tree_wflow chunk in your .qmd.

Edit this code to make a workflow with your own model of choice.

Extension/Challenge: Other than formulas, what kinds of preprocessors are supported?

05:00

Predict with your model

How do you use your new tree_fit model?

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

tree_fit <-
  workflow(forested ~ ., tree_spec) |> 
  fit(data = forested_train)

Your turn

Run:

predict(tree_fit, new_data = forested_test)

What do you notice about the structure of the result?

03:00

Your turn

Run:

augment(tree_fit, new_data = forested_test)

How does the output compare to the output from predict()?

03:00

The tidymodels prediction guarantee!

The predictions will always be inside a tibble
The column names and types are unsurprising and predictable
The number of rows in new_data and the output are the same

Understand your model

How do you understand your new tree_fit model?

Understand your model

How do you understand your new tree_fit model?

library(rpart.plot)
tree_fit |>
  extract_fit_engine() |>
  rpart.plot(roundint = FALSE)

You can extract_*() several components of your fitted workflow.

⚠️ Never predict() with any extracted components!

Understand your model

How do you understand your new tree_fit model?

You can use your fitted workflow for model and/or prediction explanations:

overall variable importance, such as with the vip package

flexible model explainers, such as with the DALEXtra package

Learn more at https://www.tmwr.org/explain.html

Your turn

Extract the model engine object from your fitted workflow and check it out.

05:00

3 - What makes a model?

Your turn

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

Your turn

Models we’ll be using today

Logistic regression

Logistic regression

Logistic regression

Decision trees

Decision trees

Decision trees

All models are wrong, but some are useful!

Logistic regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a workflow()?

A model workflow

A model workflow

A model workflow

Your turn

Predict with your model

Your turn

Your turn

The tidymodels prediction guarantee!

Understand your model

Understand your model

Understand your model

Your turn

The whole game - status update

Why a `workflow()`?