03:00
Machine learning with tidymodels
Welcome!
Wi-Fi network name
conf22
Wi-Fi password
together!
Please do not photograph people wearing red lanyards
There are gender neutral bathrooms near National Harbor rooms
A meditation room is located at National Harbor 9 (8am - 5pm, Mon - Thurs)
A lactation room is located at Potomac Dressing Room (8am - 5pm, Mon - Thurs)
Please review the rstudio::conf code of conduct, which applies to all workshops: https://www.rstudio.com/conference/2022/2022-conf-code-of-conduct/
CoC site has info on how to report a problem (in person, email, phone)
You are required to wear a mask that fully covers your mouth and nose at all times in all public spaces
You can use the magrittr %>%
or base R |>
pipe
You are familiar with functions from dplyr, tidyr, ggplot2
You have exposure to basic statistical concepts
You do not need intermediate or expert familiarity with modeling or ML
Many thanks to Julie Jung, Alison Hill, and DesirΓ©e De Leon for their role in creating these materials!
πͺ βIβm stuck and need help!β
π© βI finished the exerciseβ
Today:
Tomorrow:
Illustration credit: https://vas3k.com/blog/machine_learning/
Illustration credit: https://vas3k.com/blog/machine_learning/
How are statistics and machine learning related?
How are they similar? Different?
03:00
library(tidymodels)
#> ββ Attaching packages ββββββββββββββββββββββββββββ tidymodels 1.0.0 ββ
#> β broom 1.0.0 β rsample 1.0.0
#> β dials 1.0.0 β tibble 3.1.8
#> β dplyr 1.0.9 β tidyr 1.2.0
#> β infer 1.0.2 β tune 1.0.0
#> β modeldata 1.0.0 β workflows 1.0.0
#> β parsnip 1.0.0 β workflowsets 1.0.0
#> β purrr 0.3.4 β yardstick 1.0.0
#> β recipes 1.0.1
#> ββ Conflicts βββββββββββββββββββββββββββββββ tidymodels_conflicts() ββ
#> β purrr::discard() masks scales::discard()
#> β dplyr::filter() masks stats::filter()
#> β dplyr::lag() masks stats::lag()
#> β recipes::step() masks stats::step()
#> β’ Learn how to get started at https://www.tidymodels.org/start/
Tomorrow we will walk through a case study in detail to illustrate feature engineering and model tuning.
Today we will walk through the analysis at a higher level to show the model development process as a whole and give you an introduction to the data set.
The data are from the NHL where we want to predict whether a shot was on-goal or not! π
Itβs a good example to show how model development works.
Weβll start by using basic logistic regression to predict our binary outcome.
Our first model will have 16 simple predictor columns.
One initial question: there are 640 players taking shots.
For logistic regression, do we convert these to binary indicators (a.k.a. βdummiesβ)?
The previous models used the x/y coordinates.
Are there better ways to represent shot location?
How can we make location more usable for the model?
Once we find an acceptable model and feature set, the process is to
If you are using your own laptop instead of RStudio Cloud:
broom (1.0.0, CRAN), DALEX (2.4.0, CRAN), DALEXtra (2.2.0, CRAN), dials (1.0.0, CRAN), doParallel (1.0.17, CRAN), dplyr (1.0.9, CRAN), embed (1.0.0, CRAN), ggplot2 (3.3.6, CRAN), modeldata (1.0.0, CRAN), ongoal (0.0.2, Github (topepo/ongoal@02cd6b233), parsnip (1.0.0, CRAN), purrr (0.3.4, CRAN), ranger (0.13.1, CRAN), recipes (1.0.1, local), rpart (4.1.16, CRAN), rpart.plot (3.1.1, CRAN), rsample (1.0.0, CRAN), scales (1.2.0, CRAN), stacks (0.2.3, CRAN), tibble (3.1.8, CRAN), tidymodels (1.0.0, CRAN), tidyr (1.2.0, CRAN), tune (1.0.0, CRAN), vetiver (0.1.5, CRAN), workflows (1.0.0, CRAN), workflowsets (1.0.0, CRAN), xgboost (1.6.0.1, CRAN), and yardstick (1.0.0, CRAN)