1 - Introduction

Machine learning with tidymodels


Wi-Fi network name


Wi-Fi password


Workshop policies

  • Please do not photograph people wearing red lanyards

  • There are gender neutral bathrooms near National Harbor rooms

  • A meditation room is located at National Harbor 9 (8am - 5pm, Mon - Thurs)

  • A lactation room is located at Potomac Dressing Room (8am - 5pm, Mon - Thurs)

Workshop policies

Who are you?

  • You can use the magrittr %>% or base R |> pipe

  • You are familiar with functions from dplyr, tidyr, ggplot2

  • You have exposure to basic statistical concepts

  • You do not need intermediate or expert familiarity with modeling or ML

Who are we?

  • Simon Couch
  • Hannah Frick
  • Emil Hvitfeldt
  • Max Kuhn
  • Julia Silge
  • David Robinson
  • Davis Vaughan

Who are we?

  • Kelly Bodwin
  • Michael Chow
  • Pritam Dalal
  • Matt Dancho
  • Jon Harmon
  • Mike Mahoney
  • Edgar Ruiz
  • Asmae Toumi
  • Qiushi Yan

Many thanks to Julie Jung, Alison Hill, and DesirΓ©e De Leon for their role in creating these materials!

Asking for help

πŸŸͺ β€œI’m stuck and need help!”

🟩 β€œI finished the exercise”


Plan for this workshop

  • Today:

    • Your data budget
    • What makes a model
    • Evaluating models
  • Tomorrow:

    • Feature engineering
    • Tuning hyperparameters
    • Wrapping up!

Introduce yourself to your neighbors πŸ‘‹

Log in to RStudio Cloud here (free):


What is machine learning?

What is machine learning?

What is machine learning?

Your turn

How are statistics and machine learning related?

How are they similar? Different?


What is tidymodels?

#> ── Attaching packages ──────────────────────────── tidymodels 1.0.0 ──
#> βœ” broom        1.0.0     βœ” rsample      1.0.0
#> βœ” dials        1.0.0     βœ” tibble       3.1.8
#> βœ” dplyr        1.0.9     βœ” tidyr        1.2.0
#> βœ” infer        1.0.2     βœ” tune         1.0.0
#> βœ” modeldata    1.0.0     βœ” workflows    1.0.0
#> βœ” parsnip      1.0.0     βœ” workflowsets 1.0.0
#> βœ” purrr        0.3.4     βœ” yardstick    1.0.0
#> βœ” recipes      1.0.1
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> βœ– purrr::discard() masks scales::discard()
#> βœ– dplyr::filter()  masks stats::filter()
#> βœ– dplyr::lag()     masks stats::lag()
#> βœ– recipes::step()  masks stats::step()
#> β€’ Learn how to get started at https://www.tidymodels.org/start/

The whole game

  • Tomorrow we will walk through a case study in detail to illustrate feature engineering and model tuning.

  • Today we will walk through the analysis at a higher level to show the model development process as a whole and give you an introduction to the data set.

  • The data are from the NHL where we want to predict whether a shot was on-goal or not! πŸ’

  • It’s a good example to show how model development works.

Shots on goal

Data spending

A first model

Starting point: logistic regression

  • We’ll start by using basic logistic regression to predict our binary outcome.

  • Our first model will have 16 simple predictor columns.

  • One initial question: there are 640 players taking shots.

  • For logistic regression, do we convert these to binary indicators (a.k.a. β€œdummies”)?

Basic features (inc dummy variables)

Different player encoding

What about location

The previous models used the x/y coordinates.

Are there better ways to represent shot location?

How can we make location more usable for the model?

Add shot angle?

Add shot distance?

Add shot behind goal line?

Nonlinear terms for angle and distance

Try another model

Switch to boosting and basic features

Boosting with location features

Choose wisely…

Finalize and verify

… and so on

Once we find an acceptable model and feature set, the process is to

  • Confirm our results on the test set.
  • Document the data and model development process.
  • Deploy, monitor, etc.

Let’s install some packages

If you are using your own laptop instead of RStudio Cloud:

install.packages(c("DALEXtra", "doParallel", "embed", "forcats",
                   "lme4", "ranger", "remotes", "rpart", 
                   "rpart.plot", "stacks", "tidymodels",
                   "vetiver", "xgboost"))


Or log in to RStudio Cloud:


Our versions

broom (1.0.0, CRAN), DALEX (2.4.0, CRAN), DALEXtra (2.2.0, CRAN), dials (1.0.0, CRAN), doParallel (1.0.17, CRAN), dplyr (1.0.9, CRAN), embed (1.0.0, CRAN), ggplot2 (3.3.6, CRAN), modeldata (1.0.0, CRAN), ongoal (0.0.2, Github (topepo/ongoal@02cd6b233), parsnip (1.0.0, CRAN), purrr (0.3.4, CRAN), ranger (0.13.1, CRAN), recipes (1.0.1, local), rpart (4.1.16, CRAN), rpart.plot (3.1.1, CRAN), rsample (1.0.0, CRAN), scales (1.2.0, CRAN), stacks (0.2.3, CRAN), tibble (3.1.8, CRAN), tidymodels (1.0.0, CRAN), tidyr (1.2.0, CRAN), tune (1.0.0, CRAN), vetiver (0.1.5, CRAN), workflows (1.0.0, CRAN), workflowsets (1.0.0, CRAN), xgboost (, CRAN), and yardstick (1.0.0, CRAN)