5 - Feature engineering: splines, target encoding and dates

Getting More Out of Feature Engineering and Tuning for Machine Learning

Getting set up

library(tidymodels)
library(embed)
library(extrasteps)

tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)

Hotel data

Hotel rates data set

Regression data set for predicting the average daily rate for a room, for “Resort Hotel”. The agent and company use random names.

glimpse(hotel_rates)
#> Rows: 15,402
#> Columns: 28
#> $ avg_price_per_room             <dbl> 110.00, 74.00, 81.90, 81.00, 112.20, 90…
#> $ lead_time                      <dbl> 241, 273, 248, 236, 243, 267, 94, 10, 1…
#> $ stays_in_weekend_nights        <dbl> 0, 2, 2, 2, 4, 2, 4, 0, 0, 0, 0, 0, 0, …
#> $ stays_in_week_nights           <dbl> 1, 5, 5, 5, 10, 5, 7, 1, 1, 1, 1, 1, 1,…
#> $ adults                         <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, …
#> $ children                       <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, …
#> $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ meal                           <fct> bed_and_breakfast, bed_and_breakfast, b…
#> $ country                        <fct> prt, aus, gbr, prt, gbr, null, prt, esp…
#> $ market_segment                 <fct> online_travel_agent, offline_travel_age…
#> $ distribution_channel           <fct> ta_to, ta_to, ta_to, ta_to, ta_to, ta_t…
#> $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ reserved_room_type             <fct> a, a, a, a, a, a, f, e, h, a, a, g, a, …
#> $ assigned_room_type             <fct> c, a, c, a, a, a, f, f, h, e, e, g, e, …
#> $ booking_changes                <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ agent                          <fct> devin_rivera_borrego, lia_nauth, jawhar…
#> $ company                        <fct> not_applicable, not_applicable, not_app…
#> $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ customer_type                  <fct> transient, transient_party, transient, …
#> $ required_car_parking_spaces    <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
#> $ total_of_special_requests      <dbl> 1, 0, 0, 2, 0, 0, 1, 1, 0, 2, 2, 0, 2, …
#> $ arrival_date                   <date> 2016-07-02, 2016-07-02, 2016-07-02, 20…
#> $ arrival_date_num               <dbl> 2016.5, 2016.5, 2016.5, 2016.5, 2016.5,…
#> $ near_christmas                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ near_new_years                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ historical_adr                 <dbl> 104.9811, 104.9811, 104.9811, 104.9811,…

Hotel data splitting

Generally, you should always do data splitting. We are doing it here explicitly because some artifacts of splitting data become useful later on.

set.seed(1234)
hotel_split <- initial_split(hotel_rates)
hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Your turn

Load and explore the hotel_train data

Comes loaded with the modeldata package

05:00

Nonlinear predictors

Splines

Splines

It is a way to transform a single numeric predictor into multiple numeric predictors, with the hope that the new numeric predictors are more linearly related to the outcome.

Mostly needed with linear models, but it should rarely hurt to use it.

If you’ve ever used geom_smooth() you have seen splines in action.

Splines explained

A spline is a piecewise polynomial function.

We have 2 main parameters to worry about. Number of knots and the polynomial degree.

The domain of the predictor is split into k regions, with a knot between each, and a polynomial function is fit within each region, under the constraint that they touch each other at the knot.

knots: 1, degree: 1

knots: 2, degree: 1

knots: 5, degree: 1

knots: 5, degree: 2

knots: 5, degree: 3

knots: 9, degree: 3

B-Spline features visualized - degree: 3

Splines as numbers

arrival_date_num Spline Feature 1 Spline Feature 2 Spline Feature 3 Spline Feature 4 Spline Feature 5 Spline Feature 6
2017.619 0.00 0.00 0.00 0.03 0.35 0.62
2016.844 0.15 0.59 0.26 0.00 0.00 0.00
2016.702 0.51 0.40 0.05 0.00 0.00 0.00
2017.077 0.00 0.19 0.67 0.14 0.00 0.00
2016.861 0.13 0.58 0.29 0.00 0.00 0.00
2017.019 0.00 0.30 0.62 0.07 0.00 0.00
2017.123 0.00 0.11 0.66 0.23 0.00 0.00

Splines pros and cons

Pros

  • fast
  • easy to use
  • semi-interpretable

Cons

  • adds more columns
  • will need to select the number of columns
  • can be messy outside the range

Your turn

Apply B-splines to some variables using step_spline_b()

03:00

Factors with many categories

hotel_train |>
  count(country)
#> # A tibble: 88 × 2
#>    country     n
#>    <fct>   <int>
#>  1 ago         7
#>  2 and         1
#>  3 are         3
#>  4 arg        14
#>  5 aus        31
#>  6 aut        76
#>  7 aze         2
#>  8 bel       184
#>  9 bgr         2
#> 10 bhs         1
#> # ℹ 78 more rows
hotel_train |>
  count(company)
#> # A tibble: 157 × 2
#>    company                 n
#>    <fct>               <int>
#>  1 abdou_llc               2
#>  2 afework_llc             5
#>  3 alston_pbc              3
#>  4 battle_llc             23
#>  5 bennett_and_company     3
#>  6 berhanu_pbc            53
#>  7 biggers_llc             4
#>  8 blasingime_llc          1
#>  9 boddy_llc              16
#> 10 boles_pbc               3
#> # ℹ 147 more rows
hotel_train |>
  count(agent)
#> # A tibble: 119 × 2
#>    agent                 n
#>    <fct>             <int>
#>  1 aaron_marquez         2
#>  2 alexander_drake    1117
#>  3 allen_her             1
#>  4 anas_el_bashir        1
#>  5 araseli_billy         1
#>  6 arhab_al_islam        7
#>  7 audray_tucker        38
#>  8 bernice_baltierra    35
#>  9 betzy_rodriguez      66
#> 10 brayan_guerrero       2
#> # ℹ 109 more rows

How do we handle them?

We could:

  • Make the full set of indicator variables 😳

  • Lump agents and companies that rarely occur into an “other” group

  • Use feature hashing to create a smaller set of indicator variables

  • Use target encoding to replace the county, agent, and company columns with the estimated effect of that predictor

Target encoding

Target encoding

Target encoding (also called mean encoding, likelihood encoding, impact encoding, or effect encoding) is a supervised trained method that turns a single categorical predictor into a single numeric predictor.

It is often used to deal with categorical predictors with many levels, although it works regardless.

Since it uses the outcome to train it, you need to make sure to use cross-validation to avoid overfitting.

Target encoding motivation

You have a numeric outcome and a categorical predictor. And you want to transform each value of the categorical predictor into a value that best represents the outcome?

We calculate the mean of the outcome within each level of the predictor, and use that as the new value.

Caution

Don’t do just this! We are building up the method one thing at a time. Unregularized target encoding is really prone to overfitting.

hotel_train |>
  summarise(
    mean = mean(avg_price_per_room),
    .by = agent
  )
#> # A tibble: 119 × 2
#>    agent                 mean
#>    <fct>                <dbl>
#>  1 alexander_drake      144. 
#>  2 kaylae_maxedon        62.5
#>  3 michael_mcdole        60.9
#>  4 devin_rivera_borrego 126. 
#>  5 james_richards        78.6
#>  6 estela_bonilla        41.9
#>  7 charles_najera       109. 
#>  8 reema_el_tamer       118. 
#>  9 jawhara_al_azad       90.1
#> 10 not_applicable        84.1
#> # ℹ 109 more rows

Target encoding handling unseen levels

Caution

Don’t look at the testing data set. This is done for educational purposes.

hotel_train |>
  count(agent, .drop = FALSE)
#> # A tibble: 174 × 2
#>    agent                   n
#>    <fct>               <int>
#>  1 aaron_marquez           2
#>  2 aayaat_al_farran        0
#>  3 alanah_cook             0
#>  4 alexander_drake      1117
#>  5 allen_her               1
#>  6 amirah_christian        0
#>  7 anas_el_bashir          1
#>  8 anna_beltran_moreno     0
#>  9 anna_choi               0
#> 10 araseli_billy           1
#> # ℹ 164 more rows
hotel_test |>
  count(agent, .drop = FALSE)
#> # A tibble: 174 × 2
#>    agent                   n
#>    <fct>               <int>
#>  1 aaron_marquez           1
#>  2 aayaat_al_farran        0
#>  3 alanah_cook             0
#>  4 alexander_drake       367
#>  5 allen_her               1
#>  6 amirah_christian        0
#>  7 anas_el_bashir          0
#>  8 anna_beltran_moreno     0
#>  9 anna_choi               0
#> 10 araseli_billy           1
#> # ℹ 164 more rows

Target encoding handling unseen levels

Calculate the global mean of the outcome and use it for cases that aren’t seen in the training data set.


mean(hotel_train$avg_price_per_room)
#> [1] 104.6039
hotel_train |>
  summarise(
    mean = mean(avg_price_per_room),
    .by = agent
  )
#> # A tibble: 119 × 2
#>    agent                 mean
#>    <fct>                <dbl>
#>  1 alexander_drake      144. 
#>  2 kaylae_maxedon        62.5
#>  3 michael_mcdole        60.9
#>  4 devin_rivera_borrego 126. 
#>  5 james_richards        78.6
#>  6 estela_bonilla        41.9
#>  7 charles_najera       109. 
#>  8 reema_el_tamer       118. 
#>  9 jawhara_al_azad       90.1
#> 10 not_applicable        84.1
#> # ℹ 109 more rows

How do we handle low counts?

Some of the levels have very low counts. We can’t have the same confidence in those means as the means calculated on high counts.

We use the global mean to account for 0 occurrences. Let us adjust the calculated mean by the global mean depending on the counts.

hotel_train |>
  summarise(
    mean = mean(avg_price_per_room),
    n = n(),
    .by = agent
  ) |>
  arrange(agent)
#> # A tibble: 119 × 3
#>    agent              mean     n
#>    <fct>             <dbl> <int>
#>  1 aaron_marquez     118.      2
#>  2 alexander_drake   144.   1117
#>  3 allen_her          65       1
#>  4 anas_el_bashir     99       1
#>  5 araseli_billy      40       1
#>  6 arhab_al_islam     35       7
#>  7 audray_tucker      76.0    38
#>  8 bernice_baltierra  71.3    35
#>  9 betzy_rodriguez    84.0    66
#> 10 brayan_guerrero    37.5     2
#> # ℹ 109 more rows

Partial pooling

Partial pooling somewhat lowers the risk of overfitting since it tends to correct for agents with small sample sizes. It can’t correct for improper data usage or data leakage, though.

Partial pooling results

Implentations

We have described this method solely based on analytical calculations (step_lencode()), but you could arrive at similar numbers using a model-based approach by fitting a no-intercept generalized linear model. A hierarchical version would induce partial pooling.

Target encoding in recipes

recipe(avg_price_per_room ~ ., data = hotel_train) |>
  step_lencode(
    agent, country, company,
    outcome = vars("avg_price_per_room"), smooth = TRUE,
  ) |>
  prep() |>
  bake(NULL) |>
  select(agent, country, company)
#> # A tibble: 11,551 × 3
#>    agent country company
#>    <dbl>   <dbl>   <dbl>
#>  1 144.    108.     109.
#>  2  77.2    83.4    109.
#>  3  61.5    99.9    109.
#>  4 126.    108.     109.
#>  5 144.    108.     109.
#>  6  79.9    73.8    109.
#>  7 126.     99.9    109.
#>  8  69.8   108.     109.
#>  9 126.     99.9    109.
#> 10 144.    108.     109.
#> # ℹ 11,541 more rows

Your turn

Apply target encoding to the data set, see how it affects different predictors, not just the ones we listed here.

  • step_lencode()
  • step_lencode_glm()
  • step_lencode_bayes()
  • step_lencode_mixed()
03:00

Date time variables

Date time variables

How can we represent the date column arrival_date for our model?

When we use a date column in its native format, most models in R convert it to an integer.

We can re-engineer it as:

  • Days since a reference date
  • Day of the week
  • Month
  • Year
  • Indicators for holidays

Your turn

Explore the arrival_date variable and its relation to avg_price_per_room

The lubridate package might provide helpful

05:00

arrival_date date features

using step_date( features = c("year", "month", "dow", "decimal", "mday", "doy", "week", "semester", "quarter"))

arrival_date year month dow decimal mday doy week semester quarter
2016-08-30 2016 Aug Tue 2016.661 30 243 35 2 3
2016-10-22 2016 Oct Sat 2016.806 22 296 43 2 4
2016-12-17 2016 Dec Sat 2016.959 17 352 51 2 4
2017-02-13 2017 Feb Mon 2017.118 13 44 7 1 1
2017-04-05 2017 Apr Wed 2017.258 5 95 14 1 2

arrival_date date features

Adding label = FALSE

arrival_date year month dow decimal mday doy week semester quarter
2016-08-30 2016 8 3 2016.661 30 243 35 2 3
2016-10-22 2016 10 7 2016.806 22 296 43 2 4
2016-12-17 2016 12 7 2016.959 17 352 51 2 4
2017-02-13 2017 2 2 2017.118 13 44 7 1 1
2017-04-05 2017 4 4 2017.258 5 95 14 1 2

Other recipes steps

Works the same as step_date() but for measurements smaller than day: hour, hour12, am/pm, minute, second, decimal_day.

Adds indicators for holidays. See timeDate::listHolidays() for supported holidays.

Your turn

Apply date steps to the arrival_date variable and try to see if we capture anything about avg_price_per_room

03:00

dates as numerics

holidays as numerics

What are the issues with these features?

The numeric features make it easy to capture the end or beginning, but harder to do anything more granular.

The indicators mostly care about the day itself. No information about the lead-up or aftermath

time events

Using extrasteps::step_time_event() and the almanac package, we can create useful time features.

library(almanac)

rule_1 <- weekly() |>
  recur_on_weekdays() |>
  rsetdiff(hol_christmas())

rule_2 <- monthly(since = "2000-01-01") |>
  recur_on_interval(3) |>
  recur_on_day_of_month(1)

rule_3 <- yearly("1997-06-05") |>
  recur_on_day_of_week("Thursday") |>
  recur_on_month_of_year(c("Jun", "July", "Aug"))

step_time_event()

Create a list of rules (last slide) and pass them to the rules argument of extrasteps::step_time_event()

rules <- list(rule_1 = rule_1, rule_2 = rule_2, rule_3 = rule_3)

recipe(~arrival_date, data = hotel_rates) |>
  step_time_event(arrival_date, rules = rules)

step_time_event() as numerics

Non-indicator time events

These features still have the issue that they only attach value to the date itself.

We can attach values based on how far we are away from those dates.

step_date_before()

step_date_before() - inverse

step_date_after() - inverse

step_date_nearest() - inverse

Datetime features

Avoid crafting datetime features by hand if at all possible.

Dealing with uneven month lengths, leap days (leap seconds)

Or tried to define any event that doesn’t land on the same day of the week or date each year.

The first Sunday after the first full moon on or after the vernal equinox