3 - What makes a model?

Machine learning with tidymodels

Your turn

How do you fit a linear model in R?

How many different ways can you think of?

03:00

lm for linear model
glm for generalized linear model (e.g. logistic regression)
glmnet for regularized regression
keras for regression using TensorFlow
stan for Bayesian regression
spark for large data sets

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

linear_reg()
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

linear_reg() %>%
  set_engine("glmnet")
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: glmnet

To specify a model

linear_reg() %>%
  set_engine("stan")
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: stan

To specify a model

Choose a model
Specify an engine
Set the mode

To specify a model

decision_tree()
#> Decision Tree Model Specification (unknown)
#> 
#> Computational engine: rpart

To specify a model

decision_tree() %>% 
  set_mode("regression")
#> Decision Tree Model Specification (regression)
#> 
#> Computational engine: rpart

All available models are listed at https://www.tidymodels.org/find/parsnip/

To specify a model

Choose a model
Specify an engine
Set the mode

Your turn

Run the tree_spec chunk in your .qmd.

Edit this code so it creates a different model.

05:00

All available models are listed at https://www.tidymodels.org/find/parsnip/

Models we’ll be using today

Linear regression
Decision trees

Linear regression

Outcome modeled as linear combination of predictors:

\(\mbox{latency} = \beta_0 + \beta_1\cdot\mbox{age} + \epsilon\)

Find a line that minimizes the mean squared error (MSE)

Decision trees

Series of splits or if/then statements based on predictors
First the tree grows until some condition is met (maximum depth, no more data)
Then the tree is pruned to reduce its complexity

Decision trees

All models are wrong, but some are useful!

Linear regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a `workflow()`?

Workflows handle new data better than base R tools in terms of new factor levels

You can use other preprocessors besides formulas (more on feature engineering tomorrow!)

They can help organize your work when working with multiple models

Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("regression")

tree_spec %>% 
  fit(latency ~ ., data = frog_train) 
#> parsnip model object
#> 
#> n= 456 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#>  1) root 456 2197966.00  92.90351  
#>    2) age>=4.947975 256  252347.40  60.89844  
#>      4) treatment=control 131   91424.06  48.42748 *
#>      5) treatment=gentamicin 125  119197.90  73.96800 *
#>    3) age< 4.947975 200 1347741.00 133.87000  
#>      6) treatment=control 140  986790.70 118.25710  
#>       12) reflex=mid,full 129  754363.70 111.56590 *
#>       13) reflex=low 11  158918.20 196.72730 *
#>      7) treatment=gentamicin 60  247194.60 170.30000  
#>       14) age< 4.664439 30  102190.20 147.83330  
#>         28) age>=4.566638 22   53953.86 129.77270 *
#>         29) age< 4.566638 8   21326.00 197.50000 *
#>       15) age>=4.664439 30  114719.40 192.76670 *

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("regression")

workflow() %>%
  add_formula(latency ~ .) %>%
  add_model(tree_spec) %>%
  fit(data = frog_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> latency ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 456 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#>  1) root 456 2197966.00  92.90351  
#>    2) age>=4.947975 256  252347.40  60.89844  
#>      4) treatment=control 131   91424.06  48.42748 *
#>      5) treatment=gentamicin 125  119197.90  73.96800 *
#>    3) age< 4.947975 200 1347741.00 133.87000  
#>      6) treatment=control 140  986790.70 118.25710  
#>       12) reflex=mid,full 129  754363.70 111.56590 *
#>       13) reflex=low 11  158918.20 196.72730 *
#>      7) treatment=gentamicin 60  247194.60 170.30000  
#>       14) age< 4.664439 30  102190.20 147.83330  
#>         28) age>=4.566638 22   53953.86 129.77270 *
#>         29) age< 4.566638 8   21326.00 197.50000 *
#>       15) age>=4.664439 30  114719.40 192.76670 *

A model workflow

tree_spec <-
  decision_tree() %>% 
  set_mode("regression")

workflow(latency ~ ., tree_spec) %>% 
  fit(data = frog_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> latency ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> n= 456 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#>  1) root 456 2197966.00  92.90351  
#>    2) age>=4.947975 256  252347.40  60.89844  
#>      4) treatment=control 131   91424.06  48.42748 *
#>      5) treatment=gentamicin 125  119197.90  73.96800 *
#>    3) age< 4.947975 200 1347741.00 133.87000  
#>      6) treatment=control 140  986790.70 118.25710  
#>       12) reflex=mid,full 129  754363.70 111.56590 *
#>       13) reflex=low 11  158918.20 196.72730 *
#>      7) treatment=gentamicin 60  247194.60 170.30000  
#>       14) age< 4.664439 30  102190.20 147.83330  
#>         28) age>=4.566638 22   53953.86 129.77270 *
#>         29) age< 4.566638 8   21326.00 197.50000 *
#>       15) age>=4.664439 30  114719.40 192.76670 *

Your turn

Run the tree_wflow chunk in your .qmd.

Edit this code so it uses a linear model.

05:00

Predict with your model

How do you use your new tree_fit model?

tree_spec <-
  decision_tree() %>% 
  set_mode("regression")

tree_fit <-
  workflow(latency ~ ., tree_spec) %>% 
  fit(data = frog_train)

Your turn

Run:

predict(tree_fit, new_data = frog_test)

What do you get?

03:00

Your turn

Run:

augment(tree_fit, new_data = frog_test)

What do you get?

03:00

The tidymodels prediction guarantee!

The predictions will always be inside a tibble
The column names and types are unsurprising and predictable
The number of rows in new_data and the output are the same

Understand your model

How do you understand your new tree_fit model?

Understand your model

How do you understand your new tree_fit model?

library(rpart.plot)
tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

You can extract_*() several components of your fitted workflow.

Understand your model

How do you understand your new tree_fit model?

You can use your fitted workflow for model and/or prediction explanations:

overall variable importance, such as with the vip package

flexible model explainers, such as with the DALEXtra package

Learn more at https://www.tmwr.org/explain.html

Your turn

Extract the model engine object from your fitted linear workflow.

⚠️ Never predict() with any extracted components!

05:00

Deploy your model

Deploying a model

How do you use your new tree_fit model in production?

library(vetiver)
v <- vetiver_model(tree_fit, "frog_hatching")
v
#> 
#> ── frog_hatching ─ <butchered_workflow> model for deployment 
#> A rpart regression modeling workflow using 4 features

Learn more at https://vetiver.rstudio.com

Deploy your model

How do you use your new model tree_fit in production?

library(plumber)
pr() %>%
  vetiver_api(v)
#> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router.
#> # Use `pr_run()` on this object to start the API.
#> ├──[queryString]
#> ├──[body]
#> ├──[cookieParser]
#> ├──[sharedSecret]
#> ├──/logo
#> │  │ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/vetiver
#> ├──/ping (GET)
#> └──/predict (POST)

Learn more at https://vetiver.rstudio.com

Your turn

Run the vetiver chunk in your .qmd.

Check out the automated visual documentation.

05:00

3 - What makes a model?

Your turn

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

Your turn

Models we’ll be using today

Linear regression

Linear regression

Linear regression

Decision trees

Decision trees

Decision trees

All models are wrong, but some are useful!

Linear regression

Decision trees

A model workflow

Workflows bind preprocessors and models

What is wrong with this?

Why a workflow()?

A model workflow

A model workflow

A model workflow

Your turn

Predict with your model

Your turn

Your turn

The tidymodels prediction guarantee!

Understand your model

Understand your model

Understand your model

Your turn

Deploy your model

Deploying a model

Deploy your model

Your turn

Why a `workflow()`?