3 - What makes a model?

Machine learning with tidymodels

Your turn

How do you fit a linear model in R?

How many different ways can you think of?

03:00

lm for linear model
glm for generalized linear model (e.g. logistic regression)
glmnet for regularized regression
keras for regression using TensorFlow
stan for Bayesian regression
spark for large data sets

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

logistic_reg()
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

logistic_reg() %>%
  set_engine("glmnet")
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glmnet

To specify a model

logistic_reg() %>%
  set_engine("stan")
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: stan

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

decision_tree()
#> Decision Tree Model Specification (unknown mode)
#> 
#> Computational engine: rpart

To specify a model

decision_tree() %>% 
  set_mode("classification")
#> Decision Tree Model Specification (classification)
#> 
#> Computational engine: rpart

All available models are listed at https://www.tidymodels.org/find/parsnip/

To specify a model

Choose a model
Specify an engine
Set the mode

Your turn

Run the tree_spec chunk in your .qmd.

Edit this code to use a different model.

05:00

All available models are listed at https://www.tidymodels.org/find/parsnip/

Models we’ll be using today

Logistic regression
Decision trees

Logistic regression

Logit of outcome probability modeled as linear combination of predictors:

\(log(\frac{p}{1 - p}) = \beta_0 + \beta_1\cdot \text{distance}\)

Find a sigmoid line that separates the two classes

Decision trees

Series of splits or if/then statements based on predictors
First the tree grows until some condition is met (maximum depth, no more data)
Then the tree is pruned to reduce its complexity

Decision trees

All models are wrong, but some are useful!

Logistic regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a `workflow()`?

Workflows handle new data better than base R tools in terms of new factor levels

You can use other preprocessors besides formulas (more on feature engineering tomorrow!)

They can help organize your work when working with multiple models

Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("classification")

tree_spec %>% 
  fit(tip ~ ., data = taxi_train) 
#> parsnip model object
#> 
#> n= 7045 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 7045 2069 yes (0.70631654 0.29368346)  
#>    2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328  744 yes (0.82809612 0.17190388)  
#>      4) distance< 4.615 2365  254 yes (0.89260042 0.10739958) *
#>      5) distance>=4.615 1963  490 yes (0.75038207 0.24961793)  
#>       10) distance>=12.565 1069   81 yes (0.92422825 0.07577175) *
#>       11) distance< 12.565 894  409 yes (0.54250559 0.45749441)  
#>         22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278   71 yes (0.74460432 0.25539568) *
#>         23) company=City Service,other 616  278 no (0.45129870 0.54870130)  
#>           46) distance< 7.205 178   59 yes (0.66853933 0.33146067) *
#>           47) distance>=7.205 438  159 no (0.36301370 0.63698630) *
#>    3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022)  
#>      6) distance< 3.235 1331  391 yes (0.70623591 0.29376409) *
#>      7) distance>=3.235 1386  452 no (0.32611833 0.67388167)  
#>       14) distance>=12.39 344   90 yes (0.73837209 0.26162791) *
#>       15) distance< 12.39 1042  198 no (0.19001919 0.80998081) *

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("classification")

workflow() %>%
  add_formula(tip ~ .) %>%
  add_model(tree_spec) %>%
  fit(data = taxi_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> tip ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 7045 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 7045 2069 yes (0.70631654 0.29368346)  
#>    2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328  744 yes (0.82809612 0.17190388)  
#>      4) distance< 4.615 2365  254 yes (0.89260042 0.10739958) *
#>      5) distance>=4.615 1963  490 yes (0.75038207 0.24961793)  
#>       10) distance>=12.565 1069   81 yes (0.92422825 0.07577175) *
#>       11) distance< 12.565 894  409 yes (0.54250559 0.45749441)  
#>         22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278   71 yes (0.74460432 0.25539568) *
#>         23) company=City Service,other 616  278 no (0.45129870 0.54870130)  
#>           46) distance< 7.205 178   59 yes (0.66853933 0.33146067) *
#>           47) distance>=7.205 438  159 no (0.36301370 0.63698630) *
#>    3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022)  
#>      6) distance< 3.235 1331  391 yes (0.70623591 0.29376409) *
#>      7) distance>=3.235 1386  452 no (0.32611833 0.67388167)  
#>       14) distance>=12.39 344   90 yes (0.73837209 0.26162791) *
#>       15) distance< 12.39 1042  198 no (0.19001919 0.80998081) *

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("classification")

workflow(tip ~ ., tree_spec) %>% 
  fit(data = taxi_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> tip ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 7045 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 7045 2069 yes (0.70631654 0.29368346)  
#>    2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328  744 yes (0.82809612 0.17190388)  
#>      4) distance< 4.615 2365  254 yes (0.89260042 0.10739958) *
#>      5) distance>=4.615 1963  490 yes (0.75038207 0.24961793)  
#>       10) distance>=12.565 1069   81 yes (0.92422825 0.07577175) *
#>       11) distance< 12.565 894  409 yes (0.54250559 0.45749441)  
#>         22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278   71 yes (0.74460432 0.25539568) *
#>         23) company=City Service,other 616  278 no (0.45129870 0.54870130)  
#>           46) distance< 7.205 178   59 yes (0.66853933 0.33146067) *
#>           47) distance>=7.205 438  159 no (0.36301370 0.63698630) *
#>    3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022)  
#>      6) distance< 3.235 1331  391 yes (0.70623591 0.29376409) *
#>      7) distance>=3.235 1386  452 no (0.32611833 0.67388167)  
#>       14) distance>=12.39 344   90 yes (0.73837209 0.26162791) *
#>       15) distance< 12.39 1042  198 no (0.19001919 0.80998081) *

Your turn

Run the tree_wflow chunk in your .qmd.

Edit this code to make a workflow with your own model of choice.

05:00

Predict with your model

How do you use your new tree_fit model?

tree_spec <-
  decision_tree() %>% 
  set_mode("classification")

tree_fit <-
  workflow(tip ~ ., tree_spec) %>% 
  fit(data = taxi_train)

Your turn

Run:

predict(tree_fit, new_data = taxi_test)

What do you get?

03:00

Your turn

Run:

augment(tree_fit, new_data = taxi_test)

What do you get?

03:00

The tidymodels prediction guarantee!

The predictions will always be inside a tibble
The column names and types are unsurprising and predictable
The number of rows in new_data and the output are the same

Understand your model

How do you understand your new tree_fit model?

Understand your model

How do you understand your new tree_fit model?

library(rpart.plot)
tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

You can extract_*() several components of your fitted workflow.

Understand your model

How do you understand your new tree_fit model?

You can use your fitted workflow for model and/or prediction explanations:

overall variable importance, such as with the vip package

flexible model explainers, such as with the DALEXtra package

Learn more at https://www.tmwr.org/explain.html

Your turn

Extract the model engine object from your fitted workflow.

⚠️ Never predict() with any extracted components!

05:00

Deploy your model

Deploying a model

How do you use your new tree_fit model in production?

library(vetiver)
v <- vetiver_model(tree_fit, "taxi")
v
#> 
#> ── taxi ─ <bundled_workflow> model for deployment 
#> A rpart classification modeling workflow using 6 features

Learn more at https://vetiver.rstudio.com

Deploy your model

How do you use your new model tree_fit in production?

library(plumber)
pr() %>%
  vetiver_api(v)
#> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router.
#> # Use `pr_run()` on this object to start the API.
#> ├──[queryString]
#> ├──[body]
#> ├──[cookieParser]
#> ├──[sharedSecret]
#> ├──/logo
#> │  │ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/vetiver
#> ├──/ping (GET)
#> └──/predict (POST)

Learn more at https://vetiver.rstudio.com

Your turn

Run the vetiver chunk in your .qmd.

Check out the automated visual documentation.

05:00

3 - What makes a model?

Your turn

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

Your turn

Models we’ll be using today

Logistic regression

Logistic regression

Logistic regression

Decision trees

Decision trees

Decision trees

All models are wrong, but some are useful!

Logistic regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a workflow()?

A model workflow

A model workflow

A model workflow

Your turn

Predict with your model

Your turn

Your turn

The tidymodels prediction guarantee!

Understand your model

Understand your model

Understand your model

Your turn

Deploy your model

Deploying a model

Deploy your model

Your turn

The whole game - status update

Why a `workflow()`?