1 - Introduction

Advanced tidymodels

Welcome!

Wi-Fi network name

Posit Conf 2023

Wi-Fi password

conf2023

Workshop policies

Please do not photograph people wearing red lanyards
There are gender-neutral bathrooms located are among the Grand Suite Bathrooms
There are two meditation/prayer rooms: Grand Suite 2A and 2B
A lactation room is located in Grand Suite 1
The meditation/prayer and lactation rooms are open
Sun - Tue 7:30am - 7:00pm, Wed 8:00am - 6:00pm

Workshop policies

Please review the code of conduct and COVID policies, which apply to all workshops: https://posit.co/code-of-conduct/.
CoC site has info on how to report a problem (in person, email, phone)

Who are you?

You can use the magrittr %>% or base R |> pipe
You are familiar with functions from dplyr, tidyr, ggplot2
You have exposure to basic statistical concepts
You do not need intermediate or expert familiarity with modeling or ML
You have used some tidymodels packages
You have some experience with evaluating statistical models using resampling techniques

Who are tidymodels?

Simon Couch
Hannah Frick
Emil Hvitfeldt
Max Kuhn

Ijeamaka Anyene (Day 1) and Edgar Ruiz (Day 2) are TAing!

Many thanks to Davis Vaughan, Julia Silge, David Robinson, Julie Jung, Alison Hill, and Desirée De Leon for their role in creating these materials!

Asking for help

🟪 “I’m stuck and need help!”

🟩 “I finished the exercise”

👀

Tentative plan for this workshop

Feature engineering with recipes
Model optimization by tuning
- Grid search
- Racing
- Iterative methods
Extras (time permitting)
- Effect encodings
- A case study

Introduce yourself to your neighbors 👋

Log in to Posit Cloud (free):

Check the workshop channel on Discord for the link!

Let’s install some packages

If you are using your own laptop instead of RStudio Cloud:

# Install the packages for the workshop
pkgs <- 
  c("bonsai", "doParallel", "embed", "finetune", "lightgbm", "lme4",
    "plumber", "probably", "ranger", "rpart", "rpart.plot", "rules",
    "splines2", "stacks", "text2vec", "textrecipes", "tidymodels", 
    "vetiver", "remotes")

install.packages(pkgs)

Or log in to Posit Cloud

Link in our Discord channel!

Hotel Data

We’ll use data on hotels to predict the cost of a room.

The data are in the modeldata package. We’ll sample down the data and refactor some columns:

library(tidymodels)

# Max's usual settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Hotel date columns

names(hotel_rates)
#>  [1] "avg_price_per_room"             "lead_time"                     
#>  [3] "stays_in_weekend_nights"        "stays_in_week_nights"          
#>  [5] "adults"                         "children"                      
#>  [7] "babies"                         "meal"                          
#>  [9] "country"                        "market_segment"                
#> [11] "distribution_channel"           "is_repeated_guest"             
#> [13] "previous_cancellations"         "previous_bookings_not_canceled"
#> [15] "reserved_room_type"             "assigned_room_type"            
#> [17] "booking_changes"                "agent"                         
#> [19] "company"                        "days_in_waiting_list"          
#> [21] "customer_type"                  "required_car_parking_spaces"   
#> [23] "total_of_special_requests"      "arrival_date_num"              
#> [25] "near_christmas"                 "near_new_years"                
#> [27] "historical_adr"

Data splitting strategy

Data Spending

Let’s split the data into a training set (75%) and testing set (25%) using stratification:

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Your turn

Let’s take some time and investigate the training data. The outcome is avg_price_per_room.

Are there any interesting characteristics of the data?

10:00

Our versions

R version 4.2.2 (2022-10-31), Quarto (1.4.104)

package	version
bonsai	0.2.1
broom	1.0.5
dials	1.2.0
doParallel	1.0.17
dplyr	1.1.3
embed	1.1.2
finetune	1.1.0
ggplot2	3.4.3
lightgbm	3.3.5

package	version
lme4	1.1-34
modeldata	1.2.0
parsnip	1.1.1
plumber	1.2.1
probably	1.0.2
purrr	1.0.2
ranger	0.15.1
recipes	1.0.8
remotes	2.4.2.1

package	version
rpart	4.1.19
rpart.plot	3.1.1
rsample	1.2.0
rules	1.0.2
scales	1.2.1
splines2	0.5.1
stacks	1.0.2
text2vec	0.6.3
textrecipes	1.0.4

package	version
tibble	3.2.1
tidymodels	1.1.1
tidyr	1.3.0
tune	1.1.2
vetiver	0.2.4
workflows	1.1.3
workflowsets	1.0.1
yardstick	1.2.0