5 - Feature engineering: splines, target encoding and dates

Getting More Out of Feature Engineering and Tuning for Machine Learning

Getting set up

library(tidymodels)
library(embed)
library(extrasteps)

tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)

Hotel data

Hotel rates data set

Regression data set for predicting the average daily rate for a room, for “Resort Hotel”. The agent and company use random names.

glimpse(hotel_rates)
#> Rows: 15,402
#> Columns: 28
#> $ avg_price_per_room             <dbl> 110.00, 74.00, 81.90, 81.00, 112.20, 90…
#> $ lead_time                      <dbl> 241, 273, 248, 236, 243, 267, 94, 10, 1…
#> $ stays_in_weekend_nights        <dbl> 0, 2, 2, 2, 4, 2, 4, 0, 0, 0, 0, 0, 0, …
#> $ stays_in_week_nights           <dbl> 1, 5, 5, 5, 10, 5, 7, 1, 1, 1, 1, 1, 1,…
#> $ adults                         <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, …
#> $ children                       <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, …
#> $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ meal                           <fct> bed_and_breakfast, bed_and_breakfast, b…
#> $ country                        <fct> prt, aus, gbr, prt, gbr, null, prt, esp…
#> $ market_segment                 <fct> online_travel_agent, offline_travel_age…
#> $ distribution_channel           <fct> ta_to, ta_to, ta_to, ta_to, ta_to, ta_t…
#> $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ reserved_room_type             <fct> a, a, a, a, a, a, f, e, h, a, a, g, a, …
#> $ assigned_room_type             <fct> c, a, c, a, a, a, f, f, h, e, e, g, e, …
#> $ booking_changes                <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ agent                          <fct> devin_rivera_borrego, lia_nauth, jawhar…
#> $ company                        <fct> not_applicable, not_applicable, not_app…
#> $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ customer_type                  <fct> transient, transient_party, transient, …
#> $ required_car_parking_spaces    <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
#> $ total_of_special_requests      <dbl> 1, 0, 0, 2, 0, 0, 1, 1, 0, 2, 2, 0, 2, …
#> $ arrival_date                   <date> 2016-07-02, 2016-07-02, 2016-07-02, 20…
#> $ arrival_date_num               <dbl> 2016.5, 2016.5, 2016.5, 2016.5, 2016.5,…
#> $ near_christmas                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ near_new_years                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ historical_adr                 <dbl> 104.9811, 104.9811, 104.9811, 104.9811,…

Hotel data splitting

Generally, you should always do data splitting. We are doing it here explicitly because some artifacts of splitting data become useful later on.

set.seed(1234)
hotel_split <- initial_split(hotel_rates)
hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Your turn

Load and explore the hotel_train data

Comes loaded with the modeldata package

05:00

Nonlinear predictors

Splines

It is a way to transform a single numeric predictor into multiple numeric predictors, with the hope that the new numeric predictors are more linearly related to the outcome.

Mostly needed with linear models, but it should rarely hurt to use it.

If you’ve ever used geom_smooth() you have seen splines in action.

FEAZ FES

Splines explained

A spline is a piecewise polynomial function.

We have 2 main parameters to worry about. Number of knots and the polynomial degree.

The domain of the predictor is split into k regions, with a knot between each, and a polynomial function is fit within each region, under the constraint that they touch each other at the knot.

knots: 1, degree: 1

knots: 2, degree: 1

knots: 5, degree: 1

knots: 5, degree: 2

knots: 5, degree: 3

knots: 9, degree: 3

B-Spline features visualized - degree: 3

Splines as numbers

arrival_date_num	Spline Feature 1	Spline Feature 2	Spline Feature 3	Spline Feature 4	Spline Feature 5	Spline Feature 6
2017.619	0.00	0.00	0.00	0.03	0.35	0.62
2016.844	0.15	0.59	0.26	0.00	0.00	0.00
2016.702	0.51	0.40	0.05	0.00	0.00	0.00
2017.077	0.00	0.19	0.67	0.14	0.00	0.00
2016.861	0.13	0.58	0.29	0.00	0.00	0.00
2017.019	0.00	0.30	0.62	0.07	0.00	0.00
2017.123	0.00	0.11	0.66	0.23	0.00	0.00

Splines pros and cons

Pros

fast
easy to use
semi-interpretable

Cons

adds more columns
will need to select the number of columns
can be messy outside the range

Your turn

Apply B-splines to some variables using step_spline_b()

03:00

Factors with many categories

hotel_train |>
  count(country)
#> # A tibble: 88 × 2
#>    country     n
#>    <fct>   <int>
#>  1 ago         7
#>  2 and         1
#>  3 are         3
#>  4 arg        14
#>  5 aus        31
#>  6 aut        76
#>  7 aze         2
#>  8 bel       184
#>  9 bgr         2
#> 10 bhs         1
#> # ℹ 78 more rows

hotel_train |>
  count(company)
#> # A tibble: 157 × 2
#>    company                 n
#>    <fct>               <int>
#>  1 abdou_llc               2
#>  2 afework_llc             5
#>  3 alston_pbc              3
#>  4 battle_llc             23
#>  5 bennett_and_company     3
#>  6 berhanu_pbc            53
#>  7 biggers_llc             4
#>  8 blasingime_llc          1
#>  9 boddy_llc              16
#> 10 boles_pbc               3
#> # ℹ 147 more rows

hotel_train |>
  count(agent)
#> # A tibble: 119 × 2
#>    agent                 n
#>    <fct>             <int>
#>  1 aaron_marquez         2
#>  2 alexander_drake    1117
#>  3 allen_her             1
#>  4 anas_el_bashir        1
#>  5 araseli_billy         1
#>  6 arhab_al_islam        7
#>  7 audray_tucker        38
#>  8 bernice_baltierra    35
#>  9 betzy_rodriguez      66
#> 10 brayan_guerrero       2
#> # ℹ 109 more rows

How do we handle them?

We could:

Make the full set of indicator variables 😳
Lump agents and companies that rarely occur into an “other” group
Use feature hashing to create a smaller set of indicator variables
Use target encoding to replace the county, agent, and company columns with the estimated effect of that predictor

Target encoding

Target encoding (also called mean encoding, likelihood encoding, impact encoding, or effect encoding) is a supervised trained method that turns a single categorical predictor into a single numeric predictor.

It is often used to deal with categorical predictors with many levels, although it works regardless.

Since it uses the outcome to train it, you need to make sure to use cross-validation to avoid overfitting.

FEAZ FES

Target encoding motivation

You have a numeric outcome and a categorical predictor. And you want to transform each value of the categorical predictor into a value that best represents the outcome?

We calculate the mean of the outcome within each level of the predictor, and use that as the new value.

Caution

Don’t do just this! We are building up the method one thing at a time. Unregularized target encoding is really prone to overfitting.

hotel_train |>
  summarise(
    mean = mean(avg_price_per_room),
    .by = agent
  )
#> # A tibble: 119 × 2
#>    agent                 mean
#>    <fct>                <dbl>
#>  1 alexander_drake      144. 
#>  2 kaylae_maxedon        62.5
#>  3 michael_mcdole        60.9
#>  4 devin_rivera_borrego 126. 
#>  5 james_richards        78.6
#>  6 estela_bonilla        41.9
#>  7 charles_najera       109. 
#>  8 reema_el_tamer       118. 
#>  9 jawhara_al_azad       90.1
#> 10 not_applicable        84.1
#> # ℹ 109 more rows

Target encoding handling unseen levels

Caution

Don’t look at the testing data set. This is done for educational purposes.

hotel_train |>
  count(agent, .drop = FALSE)
#> # A tibble: 174 × 2
#>    agent                   n
#>    <fct>               <int>
#>  1 aaron_marquez           2
#>  2 aayaat_al_farran        0
#>  3 alanah_cook             0
#>  4 alexander_drake      1117
#>  5 allen_her               1
#>  6 amirah_christian        0
#>  7 anas_el_bashir          1
#>  8 anna_beltran_moreno     0
#>  9 anna_choi               0
#> 10 araseli_billy           1
#> # ℹ 164 more rows

hotel_test |>
  count(agent, .drop = FALSE)
#> # A tibble: 174 × 2
#>    agent                   n
#>    <fct>               <int>
#>  1 aaron_marquez           1
#>  2 aayaat_al_farran        0
#>  3 alanah_cook             0
#>  4 alexander_drake       367
#>  5 allen_her               1
#>  6 amirah_christian        0
#>  7 anas_el_bashir          0
#>  8 anna_beltran_moreno     0
#>  9 anna_choi               0
#> 10 araseli_billy           1
#> # ℹ 164 more rows

Target encoding handling unseen levels

Calculate the global mean of the outcome and use it for cases that aren’t seen in the training data set.

mean(hotel_train$avg_price_per_room)
#> [1] 104.6039

hotel_train |>
  summarise(
    mean = mean(avg_price_per_room),
    .by = agent
  )
#> # A tibble: 119 × 2
#>    agent                 mean
#>    <fct>                <dbl>
#>  1 alexander_drake      144. 
#>  2 kaylae_maxedon        62.5
#>  3 michael_mcdole        60.9
#>  4 devin_rivera_borrego 126. 
#>  5 james_richards        78.6
#>  6 estela_bonilla        41.9
#>  7 charles_najera       109. 
#>  8 reema_el_tamer       118. 
#>  9 jawhara_al_azad       90.1
#> 10 not_applicable        84.1
#> # ℹ 109 more rows

How do we handle low counts?

Some of the levels have very low counts. We can’t have the same confidence in those means as the means calculated on high counts.

We use the global mean to account for 0 occurrences. Let us adjust the calculated mean by the global mean depending on the counts.

hotel_train |>
  summarise(
    mean = mean(avg_price_per_room),
    n = n(),
    .by = agent
  ) |>
  arrange(agent)
#> # A tibble: 119 × 3
#>    agent              mean     n
#>    <fct>             <dbl> <int>
#>  1 aaron_marquez     118.      2
#>  2 alexander_drake   144.   1117
#>  3 allen_her          65       1
#>  4 anas_el_bashir     99       1
#>  5 araseli_billy      40       1
#>  6 arhab_al_islam     35       7
#>  7 audray_tucker      76.0    38
#>  8 bernice_baltierra  71.3    35
#>  9 betzy_rodriguez    84.0    66
#> 10 brayan_guerrero    37.5     2
#> # ℹ 109 more rows

Partial pooling

Partial pooling somewhat lowers the risk of overfitting since it tends to correct for agents with small sample sizes. It can’t correct for improper data usage or data leakage, though.

Partial pooling results

Implentations

We have described this method solely based on analytical calculations (step_lencode()), but you could arrive at similar numbers using a model-based approach by fitting a no-intercept generalized linear model. A hierarchical version would induce partial pooling.

Target encoding in recipes

recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_lencode(
    agent, country, company,
    outcome = vars("avg_price_per_room"), smooth = TRUE,
  ) |>
  prep() |>
  bake(NULL) |>
  select(agent, country, company)
#> # A tibble: 11,551 × 3
#>    agent country company
#>    <dbl>   <dbl>   <dbl>
#>  1 144.    108.     109.
#>  2  77.2    83.4    109.
#>  3  61.5    99.9    109.
#>  4 126.    108.     109.
#>  5 144.    108.     109.
#>  6  79.9    73.8    109.
#>  7 126.     99.9    109.
#>  8  69.8   108.     109.
#>  9 126.     99.9    109.
#> 10 144.    108.     109.
#> # ℹ 11,541 more rows

Your turn

Apply target encoding to the data set, see how it affects different predictors, not just the ones we listed here.

step_lencode()
step_lencode_glm()
step_lencode_bayes()
step_lencode_mixed()

03:00

Date time variables

How can we represent the date column arrival_date for our model?

When we use a date column in its native format, most models in R convert it to an integer.

We can re-engineer it as:

Days since a reference date
Day of the week
Month
Year
Indicators for holidays

FEAZ

Your turn

Explore the arrival_date variable and its relation to avg_price_per_room

The lubridate package might provide helpful

05:00

`arrival_date` date features

using step_date( features = c("year", "month", "dow", "decimal", "mday", "doy", "week", "semester", "quarter"))

arrival_date	year	month	dow	decimal	mday	doy	week	semester	quarter
2016-08-30	2016	Aug	Tue	2016.661	30	243	35	2	3
2016-10-22	2016	Oct	Sat	2016.806	22	296	43	2	4
2016-12-17	2016	Dec	Sat	2016.959	17	352	51	2	4
2017-02-13	2017	Feb	Mon	2017.118	13	44	7	1	1
2017-04-05	2017	Apr	Wed	2017.258	5	95	14	1	2

`arrival_date` date features

Adding label = FALSE

arrival_date	year	month	dow	decimal	mday	doy	week	semester	quarter
2016-08-30	2016	8	3	2016.661	30	243	35	2	3
2016-10-22	2016	10	7	2016.806	22	296	43	2	4
2016-12-17	2016	12	7	2016.959	17	352	51	2	4
2017-02-13	2017	2	2	2017.118	13	44	7	1	1
2017-04-05	2017	4	4	2017.258	5	95	14	1	2

Other recipes steps

step_time()

Works the same as step_date() but for measurements smaller than day: hour, hour12, am/pm, minute, second, decimal_day.

step_holiday()

Adds indicators for holidays. See timeDate::listHolidays() for supported holidays.

Your turn

Apply date steps to the arrival_date variable and try to see if we capture anything about avg_price_per_room

03:00

dates as numerics

holidays as numerics

What are the issues with these features?

The numeric features make it easy to capture the end or beginning, but harder to do anything more granular.

The indicators mostly care about the day itself. No information about the lead-up or aftermath

time events

Using extrasteps::step_time_event() and the almanac package, we can create useful time features.

library(almanac)

rule_1 <- weekly() |>
  recur_on_weekdays() |>
  rsetdiff(hol_christmas())

rule_2 <- monthly(since = "2000-01-01") |>
  recur_on_interval(3) |>
  recur_on_day_of_month(1)

rule_3 <- yearly("1997-06-05") |>
  recur_on_day_of_week("Thursday") |>
  recur_on_month_of_year(c("Jun", "July", "Aug"))

`step_time_event()`

Create a list of rules (last slide) and pass them to the rules argument of extrasteps::step_time_event()

rules <- list(rule_1 = rule_1, rule_2 = rule_2, rule_3 = rule_3)

recipe(~arrival_date, data = hotel_rates) |>
  step_time_event(arrival_date, rules = rules)

FEAZ

`step_time_event()` as numerics

Non-indicator time events

These features still have the issue that they only attach value to the date itself.

We can attach values based on how far we are away from those dates.

`step_date_before()`

`step_date_before()` - inverse

`step_date_after()` - inverse

`step_date_nearest()` - inverse

Datetime features

Avoid crafting datetime features by hand if at all possible.

Dealing with uneven month lengths, leap days (leap seconds)

Or tried to define any event that doesn’t land on the same day of the week or date each year.

The first Sunday after the first full moon on or after the vernal equinox

5 - Feature engineering: splines, target encoding and dates

Getting set up

Hotel data

Hotel rates data set

Hotel data splitting

Your turn

Nonlinear predictors

Splines

Splines

Splines explained

knots: 1, degree: 1

knots: 2, degree: 1

knots: 5, degree: 1

knots: 5, degree: 2

knots: 5, degree: 3

knots: 9, degree: 3

B-Spline features visualized - degree: 3

Splines as numbers

Splines pros and cons

Pros

Cons

Your turn

Factors with many categories

How do we handle them?

Target encoding

Target encoding

Target encoding motivation

Target encoding handling unseen levels

Target encoding handling unseen levels

How do we handle low counts?

Partial pooling

Partial pooling results

Implentations

Target encoding in recipes

Your turn

Date time variables

Date time variables

Your turn

arrival_date date features

arrival_date date features

Other recipes steps

Your turn

dates as numerics

holidays as numerics

What are the issues with these features?

time events

step_time_event()

step_time_event() as numerics

Non-indicator time events

step_date_before()

step_date_before() - inverse

step_date_after() - inverse

step_date_nearest() - inverse

Datetime features

`arrival_date` date features

`arrival_date` date features

`step_time_event()`

`step_time_event()` as numerics

`step_date_before()`

`step_date_before()` - inverse

`step_date_after()` - inverse

`step_date_nearest()` - inverse