3 - What makes a model?

Introduction to tidymodels

Your turn

How do you fit a linear model in R?

How many different ways can you think of?

03:00
  • lm for linear model

  • glm for generalized linear model (e.g. logistic regression)

  • glmnet for regularized regression

  • keras for regression using TensorFlow

  • stan for Bayesian regression

  • spark for large data sets

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode

To specify a model

logistic_reg()
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode

To specify a model

logistic_reg() %>%
  set_engine("glmnet")
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glmnet

To specify a model

logistic_reg() %>%
  set_engine("stan")
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: stan

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode

To specify a model

decision_tree()
#> Decision Tree Model Specification (unknown mode)
#> 
#> Computational engine: rpart

To specify a model

decision_tree() %>% 
  set_mode("classification")
#> Decision Tree Model Specification (classification)
#> 
#> Computational engine: rpart



All available models are listed at https://www.tidymodels.org/find/parsnip/

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode

Your turn

Run the tree_spec chunk in your .qmd.

Edit this code to use a logistic regression model.

All available models are listed at https://www.tidymodels.org/find/parsnip/



Extension/Challenge: Edit this code to use a different model. For example, try using a conditional inference tree as implemented in the partykit package by changing the engine - or try an entirely different model type!

05:00

Models we’ll be using today

  • Logistic regression
  • Decision trees

Logistic regression

Logistic regression

Logistic regression

  • Logit of outcome probability modeled as linear combination of predictors:

\(log(\frac{p}{1 - p}) = \beta_0 + \beta_1\cdot \text{A}\)

  • Find a sigmoid line that separates the two classes

Decision trees

Decision trees

  • Series of splits or if/then statements based on predictors

  • First the tree grows until some condition is met (maximum depth, no more data)

  • Then the tree is pruned to reduce its complexity

Decision trees

All models are wrong, but some are useful!

Logistic regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a workflow()?

  • Workflows handle new data better than base R tools in terms of new factor levels
  • You can use other preprocessors besides formulas (more on feature engineering in Advanced tidymodels!)
  • They can help organize your work when working with multiple models
  • Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit

A model workflow

tree_spec <-
  decision_tree(cost_complexity = 0.002) %>% 
  set_mode("classification")

tree_spec %>% 
  fit(tip ~ ., data = taxi_train) 
#> parsnip model object
#> 
#> n= 8000 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 8000 616 yes (0.92300000 0.07700000)  
#>    2) distance>=14.12 2041  68 yes (0.96668300 0.03331700) *
#>    3) distance< 14.12 5959 548 yes (0.90803826 0.09196174)  
#>      6) distance< 5.275 5419 450 yes (0.91695885 0.08304115) *
#>      7) distance>=5.275 540  98 yes (0.81851852 0.18148148)  
#>       14) company=Chicago Independents,City Service,Sun Taxi,Taxi Affiliation Services,Taxicab Insurance Agency Llc,other 478  68 yes (0.85774059 0.14225941) *
#>       15) company=Flash Cab 62  30 yes (0.51612903 0.48387097)  
#>         30) dow=Thu 12   2 yes (0.83333333 0.16666667) *
#>         31) dow=Sun,Mon,Tue,Wed,Fri,Sat 50  22 no (0.44000000 0.56000000)  
#>           62) distance>=11.77 14   4 yes (0.71428571 0.28571429) *
#>           63) distance< 11.77 36  12 no (0.33333333 0.66666667) *

A model workflow

tree_spec <-
  decision_tree(cost_complexity = 0.002) %>% 
  set_mode("classification")

workflow() %>%
  add_formula(tip ~ .) %>%
  add_model(tree_spec) %>%
  fit(data = taxi_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> tip ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 8000 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 8000 616 yes (0.92300000 0.07700000)  
#>    2) distance>=14.12 2041  68 yes (0.96668300 0.03331700) *
#>    3) distance< 14.12 5959 548 yes (0.90803826 0.09196174)  
#>      6) distance< 5.275 5419 450 yes (0.91695885 0.08304115) *
#>      7) distance>=5.275 540  98 yes (0.81851852 0.18148148)  
#>       14) company=Chicago Independents,City Service,Sun Taxi,Taxi Affiliation Services,Taxicab Insurance Agency Llc,other 478  68 yes (0.85774059 0.14225941) *
#>       15) company=Flash Cab 62  30 yes (0.51612903 0.48387097)  
#>         30) dow=Thu 12   2 yes (0.83333333 0.16666667) *
#>         31) dow=Sun,Mon,Tue,Wed,Fri,Sat 50  22 no (0.44000000 0.56000000)  
#>           62) distance>=11.77 14   4 yes (0.71428571 0.28571429) *
#>           63) distance< 11.77 36  12 no (0.33333333 0.66666667) *

A model workflow

tree_spec <-
  decision_tree(cost_complexity = 0.002) %>% 
  set_mode("classification")

workflow(tip ~ ., tree_spec) %>% 
  fit(data = taxi_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> tip ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 8000 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 8000 616 yes (0.92300000 0.07700000)  
#>    2) distance>=14.12 2041  68 yes (0.96668300 0.03331700) *
#>    3) distance< 14.12 5959 548 yes (0.90803826 0.09196174)  
#>      6) distance< 5.275 5419 450 yes (0.91695885 0.08304115) *
#>      7) distance>=5.275 540  98 yes (0.81851852 0.18148148)  
#>       14) company=Chicago Independents,City Service,Sun Taxi,Taxi Affiliation Services,Taxicab Insurance Agency Llc,other 478  68 yes (0.85774059 0.14225941) *
#>       15) company=Flash Cab 62  30 yes (0.51612903 0.48387097)  
#>         30) dow=Thu 12   2 yes (0.83333333 0.16666667) *
#>         31) dow=Sun,Mon,Tue,Wed,Fri,Sat 50  22 no (0.44000000 0.56000000)  
#>           62) distance>=11.77 14   4 yes (0.71428571 0.28571429) *
#>           63) distance< 11.77 36  12 no (0.33333333 0.66666667) *

Your turn

Run the tree_wflow chunk in your .qmd.

Edit this code to make a workflow with your own model of choice.



Extension/Challenge: Other than formulas, what kinds of preprocessors are supported?

05:00

Predict with your model

How do you use your new tree_fit model?

tree_spec <-
  decision_tree(cost_complexity = 0.002) %>% 
  set_mode("classification")

tree_fit <-
  workflow(tip ~ ., tree_spec) %>% 
  fit(data = taxi_train) 

Your turn

Run:

predict(tree_fit, new_data = taxi_test)

What do you get?

03:00

Your turn

Run:

augment(tree_fit, new_data = taxi_test)

What do you get?

03:00

The tidymodels prediction guarantee!

  • The predictions will always be inside a tibble
  • The column names and types are unsurprising and predictable
  • The number of rows in new_data and the output are the same

Understand your model

How do you understand your new tree_fit model?

Understand your model

How do you understand your new tree_fit model?

library(rpart.plot)
tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

You can extract_*() several components of your fitted workflow.

⚠️ Never predict() with any extracted components!

Understand your model

How do you understand your new tree_fit model?

You can use your fitted workflow for model and/or prediction explanations:

  • overall variable importance, such as with the vip package
  • flexible model explainers, such as with the DALEXtra package

Your turn


Extract the model engine object from your fitted workflow and check it out.

05:00

The whole game - status update