1 - Introduction

Advanced tidymodels

Welcome!

Wi-Fi network name

TODO-ADD-LATER

Wi-Fi password

TODO-ADD-LATER

Venue information

  • There are gender neutral bathrooms located on levels 3, 4, 5, 6 & 7

  • A meditation/prayer room is located in 503
    (Mon & Tue 7am - 7pm, and Wed 7am - 5pm)

  • A lactation room is located in 509
    (Mon & Tue 7am - 7pm, and Wed 7am - 5pm)

Workshop policies

  • Please review the posit::conf code of conduct, which applies to all workshops: https://posit.co/code-of-conduct

  • CoC site has info on how to report a problem (in person, email, phone)

  • Please do not photograph people wearing red lanyards

Who are you?

  • You can use the magrittr %>% or base R |> pipe

  • You are familiar with functions from dplyr, tidyr, ggplot2

  • You have exposure to basic statistical concepts

  • You do not need intermediate or expert familiarity with modeling or ML

  • You have used some tidymodels packages

  • You have some experience with evaluating statistical models using resampling techniques

Who are tidymodels?

  • Simon Couch
  • Hannah Frick
  • Emil Hvitfeldt
  • Max Kuhn

Many thanks to Davis Vaughan, Julia Silge, David Robinson, Julie Jung, Alison Hill, and Desirée De Leon for their role in creating these materials!

Asking for help

🟪 “I’m stuck and need help!”

🟩 “I finished the exercise”

Discord

  • pos.it/conf-event-portal (login)
  • Click on “Join Discord, the virtual networking platform!”
  • Browse Channels -> #workshop-tidymodels-advanced

👀

👀

Tentative plan for this workshop

  • Feature engineering with recipes
  • Model optimization by tuning
    • Grid search
    • Racing
    • Iterative methods
  • Extras (time permitting)
    • Effect encodings
    • A case study

Introduce yourself to your neighbors 👋



Log in to Posit Cloud (free): TODO-ADD-LATER

Let’s install some packages

If you are using your own laptop instead of Posit Cloud:

# Install the packages for the workshop
pkgs <- 
  c("bonsai", "Cubist", "doParallel", "earth", "embed", "finetune", 
    "forested", "lightgbm", "lme4", "parallelly", "plumber", "probably", 
    "ranger", "rpart", "rpart.plot", "rules", "splines2", "stacks", 
    "text2vec", "textrecipes", "tidymodels", "vetiver")

install.packages(pkgs)

Also, you should install the newest version of the dials package (version 1.3.0). To check this, you can run:

rlang::check_installed("dials", version = "1.3.0")

Hotel Data

We’ll use data on hotels to predict the cost of a room.

The data are in the modeldata package. We’ll sample down the data and refactor some columns:

library(tidymodels)

# Max's usual settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)
data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Hotel date columns

names(hotel_rates)
#>  [1] "avg_price_per_room"             "lead_time"                     
#>  [3] "stays_in_weekend_nights"        "stays_in_week_nights"          
#>  [5] "adults"                         "children"                      
#>  [7] "babies"                         "meal"                          
#>  [9] "country"                        "market_segment"                
#> [11] "distribution_channel"           "is_repeated_guest"             
#> [13] "previous_cancellations"         "previous_bookings_not_canceled"
#> [15] "reserved_room_type"             "assigned_room_type"            
#> [17] "booking_changes"                "agent"                         
#> [19] "company"                        "days_in_waiting_list"          
#> [21] "customer_type"                  "required_car_parking_spaces"   
#> [23] "total_of_special_requests"      "arrival_date_num"              
#> [25] "near_christmas"                 "near_new_years"                
#> [27] "historical_adr"

Data splitting strategy

Data Spending

Let’s split the data into a training set (75%) and testing set (25%) using stratification:

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Your turn

Let’s take some time and investigate the training data. The outcome is avg_price_per_room.

Are there any interesting characteristics of the data?

10:00

Our versions

R version 4.4.1 (2024-06-14), Quarto (1.6.1)

package version
bonsai 0.3.1
broom 1.0.6
Cubist 0.4.4
dials 1.3.0
doParallel 1.0.17
dplyr 1.1.4
earth 5.3.3
embed 1.1.4
finetune 1.2.0
Formula 1.2-5
package version
ggplot2 3.5.1
lattice 0.22-6
lightgbm 4.5.0
lme4 1.1-35.5
modeldata 1.4.0
parallelly 1.38.0
parsnip 1.2.1
plotmo 3.6.3
plotrix 3.8-4
plumber 1.2.2
package version
probably 1.0.3
purrr 1.0.2
recipes 1.1.0
rsample 1.2.1
rules 1.0.2
scales 1.3.0
splines2 0.5.3
stacks 1.0.5
text2vec 0.6.4
textrecipes 1.0.6
package version
tibble 3.2.1
tidymodels 1.2.0
tidyr 1.3.1
tune 1.2.1
vetiver 0.2.5
workflows 1.1.4
workflowsets 1.1.0
yardstick 1.3.1