3 - What makes a model?

Machine learning with tidymodels

Your turn

How do you fit a linear model in R?

How many different ways can you think of?

03:00
  • lm for linear model

  • glm for generalized linear model (e.g. logistic regression)

  • glmnet for regularized regression

  • keras for regression using TensorFlow

  • stan for Bayesian regression

  • spark for large data sets

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode

To specify a model

logistic_reg()
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode

To specify a model

logistic_reg() %>%
  set_engine("glmnet")
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glmnet

To specify a model

logistic_reg() %>%
  set_engine("stan")
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: stan

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode

To specify a model

decision_tree()
#> Decision Tree Model Specification (unknown mode)
#> 
#> Computational engine: rpart

To specify a model

decision_tree() %>% 
  set_mode("classification")
#> Decision Tree Model Specification (classification)
#> 
#> Computational engine: rpart



All available models are listed at https://www.tidymodels.org/find/parsnip/

To specify a model

  • Choose a model
  • Specify an engine
  • Set the mode

Your turn

Run the tree_spec chunk in your .qmd.

Edit this code to use a different model.

05:00



All available models are listed at https://www.tidymodels.org/find/parsnip/

Models we’ll be using today

  • Logistic regression
  • Decision trees

Logistic regression

Logistic regression

Logistic regression

  • Logit of outcome probability modeled as linear combination of predictors:

\(log(\frac{p}{1 - p}) = \beta_0 + \beta_1\cdot \text{distance}\)

  • Find a sigmoid line that separates the two classes

Decision trees

Decision trees

  • Series of splits or if/then statements based on predictors

  • First the tree grows until some condition is met (maximum depth, no more data)

  • Then the tree is pruned to reduce its complexity

Decision trees

All models are wrong, but some are useful!

Logistic regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a workflow()?

  • Workflows handle new data better than base R tools in terms of new factor levels
  • You can use other preprocessors besides formulas (more on feature engineering tomorrow!)
  • They can help organize your work when working with multiple models
  • Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("classification")

tree_spec %>% 
  fit(tip ~ ., data = taxi_train) 
#> parsnip model object
#> 
#> n= 7045 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 7045 2069 yes (0.70631654 0.29368346)  
#>    2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328  744 yes (0.82809612 0.17190388)  
#>      4) distance< 4.615 2365  254 yes (0.89260042 0.10739958) *
#>      5) distance>=4.615 1963  490 yes (0.75038207 0.24961793)  
#>       10) distance>=12.565 1069   81 yes (0.92422825 0.07577175) *
#>       11) distance< 12.565 894  409 yes (0.54250559 0.45749441)  
#>         22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278   71 yes (0.74460432 0.25539568) *
#>         23) company=City Service,other 616  278 no (0.45129870 0.54870130)  
#>           46) distance< 7.205 178   59 yes (0.66853933 0.33146067) *
#>           47) distance>=7.205 438  159 no (0.36301370 0.63698630) *
#>    3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022)  
#>      6) distance< 3.235 1331  391 yes (0.70623591 0.29376409) *
#>      7) distance>=3.235 1386  452 no (0.32611833 0.67388167)  
#>       14) distance>=12.39 344   90 yes (0.73837209 0.26162791) *
#>       15) distance< 12.39 1042  198 no (0.19001919 0.80998081) *

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("classification")

workflow() %>%
  add_formula(tip ~ .) %>%
  add_model(tree_spec) %>%
  fit(data = taxi_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> tip ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 7045 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 7045 2069 yes (0.70631654 0.29368346)  
#>    2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328  744 yes (0.82809612 0.17190388)  
#>      4) distance< 4.615 2365  254 yes (0.89260042 0.10739958) *
#>      5) distance>=4.615 1963  490 yes (0.75038207 0.24961793)  
#>       10) distance>=12.565 1069   81 yes (0.92422825 0.07577175) *
#>       11) distance< 12.565 894  409 yes (0.54250559 0.45749441)  
#>         22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278   71 yes (0.74460432 0.25539568) *
#>         23) company=City Service,other 616  278 no (0.45129870 0.54870130)  
#>           46) distance< 7.205 178   59 yes (0.66853933 0.33146067) *
#>           47) distance>=7.205 438  159 no (0.36301370 0.63698630) *
#>    3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022)  
#>      6) distance< 3.235 1331  391 yes (0.70623591 0.29376409) *
#>      7) distance>=3.235 1386  452 no (0.32611833 0.67388167)  
#>       14) distance>=12.39 344   90 yes (0.73837209 0.26162791) *
#>       15) distance< 12.39 1042  198 no (0.19001919 0.80998081) *

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("classification")

workflow(tip ~ ., tree_spec) %>% 
  fit(data = taxi_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> tip ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 7045 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 7045 2069 yes (0.70631654 0.29368346)  
#>    2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328  744 yes (0.82809612 0.17190388)  
#>      4) distance< 4.615 2365  254 yes (0.89260042 0.10739958) *
#>      5) distance>=4.615 1963  490 yes (0.75038207 0.24961793)  
#>       10) distance>=12.565 1069   81 yes (0.92422825 0.07577175) *
#>       11) distance< 12.565 894  409 yes (0.54250559 0.45749441)  
#>         22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278   71 yes (0.74460432 0.25539568) *
#>         23) company=City Service,other 616  278 no (0.45129870 0.54870130)  
#>           46) distance< 7.205 178   59 yes (0.66853933 0.33146067) *
#>           47) distance>=7.205 438  159 no (0.36301370 0.63698630) *
#>    3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022)  
#>      6) distance< 3.235 1331  391 yes (0.70623591 0.29376409) *
#>      7) distance>=3.235 1386  452 no (0.32611833 0.67388167)  
#>       14) distance>=12.39 344   90 yes (0.73837209 0.26162791) *
#>       15) distance< 12.39 1042  198 no (0.19001919 0.80998081) *

Your turn

Run the tree_wflow chunk in your .qmd.

Edit this code to make a workflow with your own model of choice.

05:00

Predict with your model

How do you use your new tree_fit model?

tree_spec <-
  decision_tree() %>% 
  set_mode("classification")

tree_fit <-
  workflow(tip ~ ., tree_spec) %>% 
  fit(data = taxi_train) 

Your turn

Run:

predict(tree_fit, new_data = taxi_test)

What do you get?

03:00

Your turn

Run:

augment(tree_fit, new_data = taxi_test)

What do you get?

03:00

The tidymodels prediction guarantee!

  • The predictions will always be inside a tibble
  • The column names and types are unsurprising and predictable
  • The number of rows in new_data and the output are the same

Understand your model

How do you understand your new tree_fit model?

Understand your model

How do you understand your new tree_fit model?

library(rpart.plot)
tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

You can extract_*() several components of your fitted workflow.

Understand your model

How do you understand your new tree_fit model?

You can use your fitted workflow for model and/or prediction explanations:

  • overall variable importance, such as with the vip package
  • flexible model explainers, such as with the DALEXtra package

Your turn

Extract the model engine object from your fitted workflow.

⚠️ Never predict() with any extracted components!

05:00

Deploy your model

Deploying a model

How do you use your new tree_fit model in production?

library(vetiver)
v <- vetiver_model(tree_fit, "taxi")
v
#> 
#> ── taxi ─ <bundled_workflow> model for deployment 
#> A rpart classification modeling workflow using 6 features

Learn more at https://vetiver.rstudio.com

Deploy your model

How do you use your new model tree_fit in production?

library(plumber)
pr() %>%
  vetiver_api(v)
#> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router.
#> # Use `pr_run()` on this object to start the API.
#> ├──[queryString]
#> ├──[body]
#> ├──[cookieParser]
#> ├──[sharedSecret]
#> ├──/logo
#> │  │ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/vetiver
#> ├──/ping (GET)
#> └──/predict (POST)

Learn more at https://vetiver.rstudio.com

Your turn

Run the vetiver chunk in your .qmd.

Check out the automated visual documentation.

05:00

The whole game - status update