1 - Introduction

Machine learning with tidymodels


Who are you?

  • You can use the magrittr %>% or base R |> pipe

  • You are familiar with functions from dplyr, tidyr, ggplot2

  • You have exposure to basic statistical concepts

  • You do not need intermediate or expert familiarity with modeling or ML

Who are tidymodels?

  • Simon Couch
  • Hannah Frick
  • Emil Hvitfeldt
  • Max Kuhn

Many thanks to Davis Vaughan, Julia Silge, David Robinson, Julie Jung, Alison Hill, and DesirΓ©e De Leon for their role in creating these materials!

Asking for help

πŸŸͺ β€œI’m stuck and need help!”

🟩 β€œI finished the exercise”


Tentative plan for this workshop

  • Today:

    • Your data budget
    • What makes a model
    • Evaluating models
  • Tomorrow:

    • Feature engineering
    • Tuning hyperparameters
    • Racing methods
    • Iterative search methods

Introduce yourself to your neighbors πŸ‘‹

Check Slack (#ml-ws-2023) for an RStudio Cloud link.

What is machine learning?

Your turn

How are statistics and machine learning related?

How are they similar? Different?


What is tidymodels?

#> ── Attaching packages ──────────────────────────── tidymodels 1.1.0 ──
#> βœ” broom        1.0.5          βœ” rsample
#> βœ” dials        1.2.0          βœ” tibble       3.2.1     
#> βœ” dplyr        1.1.2          βœ” tidyr        1.3.0     
#> βœ” infer        1.0.4          βœ” tune
#> βœ” modeldata    1.1.0          βœ” workflows    1.1.3     
#> βœ” parsnip     βœ” workflowsets 1.0.1     
#> βœ” purrr        1.0.1          βœ” yardstick
#> βœ” recipes      1.0.6
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> βœ– purrr::discard() masks scales::discard()
#> βœ– dplyr::filter()  masks stats::filter()
#> βœ– dplyr::lag()     masks stats::lag()
#> βœ– recipes::step()  masks stats::step()
#> β€’ Use tidymodels_prefer() to resolve common conflicts.

The whole game

Part of any modelling process is

  • Splitting your data into training and test set
  • Using a resampling scheme
  • Fitting models
  • Assessing performance
  • Choosing a model
  • Fitting and assessing the final model

The whole game

Let’s install some packages

If you are using your own laptop instead of RStudio Cloud:


pkgs <- c("bonsai", "doParallel", "embed", "finetune", "lightgbm", "lme4", 
          "parallelly", "plumber", "probably", "ranger", "rpart", "rpart.plot", 
          "stacks", "textrecipes", "tidymodels", "tidymodels/modeldatatoo", 

Our versions

