1 - Introduction

Advanced tidymodels


Venue information

  • There are gender neutral bathrooms located on levels 3, 4, 5, 6 & 7

  • A meditation/prayer room is located in 503
    (Mon & Tue 7am - 7pm, and Wed 7am - 5pm)

  • A lactation room is located in 509
    (Mon & Tue 7am - 7pm, and Wed 7am - 5pm)

Workshop policies

  • Please review the posit::conf code of conduct, which applies to all workshops: https://posit.co/code-of-conduct

  • CoC site has info on how to report a problem (in person, email, phone)

  • Please do not photograph people wearing red lanyards

Who are you?

  • You can use the magrittr %>% or base R |> pipe

  • You are familiar with functions from dplyr, tidyr, ggplot2

  • You have exposure to basic statistical concepts

  • You do not need intermediate or expert familiarity with modeling or ML

  • You have used some tidymodels packages

  • You have some experience with evaluating statistical models using resampling techniques

Who are tidymodels?

  • Simon Couch
  • Hannah Frick
  • Emil Hvitfeldt
  • Max Kuhn

Many thanks to Davis Vaughan, Julia Silge, David Robinson, Julie Jung, Alison Hill, and Desirée De Leon for their role in creating these materials!

Asking for help

🟪 “I’m stuck and need help!”

🟩 “I finished the exercise”


Tentative plan for this workshop

  • Feature engineering with recipes
  • Model optimization by tuning
    • Grid search
    • Racing
    • Iterative methods
  • Extras (time permitting)
    • Effect encodings
    • A case study

Introduce yourself to your neighbors 👋

Let’s install some packages

If you are using your own laptop instead of Posit Cloud:

# Install the packages for the workshop
pkgs <- 
  c("bonsai", "Cubist", "doParallel", "earth", "embed", "finetune", 
    "forested", "lightgbm", "lme4", "parallelly", "plumber", "probably", 
    "ranger", "rpart", "rpart.plot", "rules", "splines2", "stacks", 
    "text2vec", "textrecipes", "tidymodels", "vetiver")


Also, you should install the newest version of the dials package (version 1.3.0). To check this, you can run:

rlang::check_installed("dials", version = "1.3.0")

Hotel Data

We’ll use data on hotels to predict the cost of a room.

The data are in the modeldata package. We’ll sample down the data and refactor some columns:


# Max's usual settings: 
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))

Hotel date columns

#>  [1] "avg_price_per_room"             "lead_time"                     
#>  [3] "stays_in_weekend_nights"        "stays_in_week_nights"          
#>  [5] "adults"                         "children"                      
#>  [7] "babies"                         "meal"                          
#>  [9] "country"                        "market_segment"                
#> [11] "distribution_channel"           "is_repeated_guest"             
#> [13] "previous_cancellations"         "previous_bookings_not_canceled"
#> [15] "reserved_room_type"             "assigned_room_type"            
#> [17] "booking_changes"                "agent"                         
#> [19] "company"                        "days_in_waiting_list"          
#> [21] "customer_type"                  "required_car_parking_spaces"   
#> [23] "total_of_special_requests"      "arrival_date_num"              
#> [25] "near_christmas"                 "near_new_years"                
#> [27] "historical_adr"

Data splitting strategy

Data Spending

Let’s split the data into a training set (75%) and testing set (25%) using stratification:

hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Your turn

Let’s take some time and investigate the training data. The outcome is avg_price_per_room.

Are there any interesting characteristics of the data?


Our versions

R version 4.4.1 (2024-06-14), Quarto (1.6.1)

