Getting More Out of Feature Engineering and Tuning for Machine Learning
Regression data set for predicting the average daily rate for a room, for “Resort Hotel”. The agent
and company
use random names.
glimpse(hotel_rates)
#> Rows: 15,402
#> Columns: 28
#> $ avg_price_per_room <dbl> 110.00, 74.00, 81.90, 81.00, 112.20, 90…
#> $ lead_time <dbl> 241, 273, 248, 236, 243, 267, 94, 10, 1…
#> $ stays_in_weekend_nights <dbl> 0, 2, 2, 2, 4, 2, 4, 0, 0, 0, 0, 0, 0, …
#> $ stays_in_week_nights <dbl> 1, 5, 5, 5, 10, 5, 7, 1, 1, 1, 1, 1, 1,…
#> $ adults <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, …
#> $ children <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, …
#> $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ meal <fct> bed_and_breakfast, bed_and_breakfast, b…
#> $ country <fct> prt, aus, gbr, prt, gbr, null, prt, esp…
#> $ market_segment <fct> online_travel_agent, offline_travel_age…
#> $ distribution_channel <fct> ta_to, ta_to, ta_to, ta_to, ta_to, ta_t…
#> $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ reserved_room_type <fct> a, a, a, a, a, a, f, e, h, a, a, g, a, …
#> $ assigned_room_type <fct> c, a, c, a, a, a, f, f, h, e, e, g, e, …
#> $ booking_changes <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ agent <fct> devin_rivera_borrego, lia_nauth, jawhar…
#> $ company <fct> not_applicable, not_applicable, not_app…
#> $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ customer_type <fct> transient, transient_party, transient, …
#> $ required_car_parking_spaces <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
#> $ total_of_special_requests <dbl> 1, 0, 0, 2, 0, 0, 1, 1, 0, 2, 2, 0, 2, …
#> $ arrival_date <date> 2016-07-02, 2016-07-02, 2016-07-02, 20…
#> $ arrival_date_num <dbl> 2016.5, 2016.5, 2016.5, 2016.5, 2016.5,…
#> $ near_christmas <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ near_new_years <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ historical_adr <dbl> 104.9811, 104.9811, 104.9811, 104.9811,…
Antonio, N., de Almeida, A., and Nunes, L. (2019). Hotel booking demand datasets. Data in Brief, 22, 41-49.
Generally, you should always do data splitting. We are doing it here explicitly because some artifacts of splitting data become useful later on.
Load and explore the hotel_train data
Comes loaded with the modeldata package
05:00
It is a way to transform a single numeric predictor into multiple numeric predictors, with the hope that the new numeric predictors are more linearly related to the outcome.
Mostly needed with linear models, but it should rarely hurt to use it.
If you’ve ever used geom_smooth()
you have seen splines in action.
A spline is a piecewise polynomial function.
We have 2 main parameters to worry about. Number of knots and the polynomial degree.
The domain of the predictor is split into k
regions, with a knot between each, and a polynomial function is fit within each region, under the constraint that they touch each other at the knot.
arrival_date_num | Spline Feature 1 | Spline Feature 2 | Spline Feature 3 | Spline Feature 4 | Spline Feature 5 | Spline Feature 6 |
---|---|---|---|---|---|---|
2017.619 | 0.00 | 0.00 | 0.00 | 0.03 | 0.35 | 0.62 |
2016.844 | 0.15 | 0.59 | 0.26 | 0.00 | 0.00 | 0.00 |
2016.702 | 0.51 | 0.40 | 0.05 | 0.00 | 0.00 | 0.00 |
2017.077 | 0.00 | 0.19 | 0.67 | 0.14 | 0.00 | 0.00 |
2016.861 | 0.13 | 0.58 | 0.29 | 0.00 | 0.00 | 0.00 |
2017.019 | 0.00 | 0.30 | 0.62 | 0.07 | 0.00 | 0.00 |
2017.123 | 0.00 | 0.11 | 0.66 | 0.23 | 0.00 | 0.00 |
Apply B-splines to some variables using step_spline_b()
03:00
hotel_train |>
count(agent)
#> # A tibble: 119 × 2
#> agent n
#> <fct> <int>
#> 1 aaron_marquez 2
#> 2 alexander_drake 1117
#> 3 allen_her 1
#> 4 anas_el_bashir 1
#> 5 araseli_billy 1
#> 6 arhab_al_islam 7
#> 7 audray_tucker 38
#> 8 bernice_baltierra 35
#> 9 betzy_rodriguez 66
#> 10 brayan_guerrero 2
#> # ℹ 109 more rows
We could:
Make the full set of indicator variables 😳
Lump agents and companies that rarely occur into an “other” group
Use feature hashing to create a smaller set of indicator variables
Use target encoding to replace the county
, agent
, and company
columns with the estimated effect of that predictor
Target encoding (also called mean encoding, likelihood encoding, impact encoding, or effect encoding) is a supervised trained method that turns a single categorical predictor into a single numeric predictor.
It is often used to deal with categorical predictors with many levels, although it works regardless.
Since it uses the outcome to train it, you need to make sure to use cross-validation to avoid overfitting.
You have a numeric outcome and a categorical predictor. And you want to transform each value of the categorical predictor into a value that best represents the outcome?
We calculate the mean of the outcome within each level of the predictor, and use that as the new value.
Caution
Don’t do just this! We are building up the method one thing at a time. Unregularized target encoding is really prone to overfitting.
hotel_train |>
summarise(
mean = mean(avg_price_per_room),
.by = agent
)
#> # A tibble: 119 × 2
#> agent mean
#> <fct> <dbl>
#> 1 alexander_drake 144.
#> 2 kaylae_maxedon 62.5
#> 3 michael_mcdole 60.9
#> 4 devin_rivera_borrego 126.
#> 5 james_richards 78.6
#> 6 estela_bonilla 41.9
#> 7 charles_najera 109.
#> 8 reema_el_tamer 118.
#> 9 jawhara_al_azad 90.1
#> 10 not_applicable 84.1
#> # ℹ 109 more rows
Caution
Don’t look at the testing data set. This is done for educational purposes.
hotel_train |>
count(agent, .drop = FALSE)
#> # A tibble: 174 × 2
#> agent n
#> <fct> <int>
#> 1 aaron_marquez 2
#> 2 aayaat_al_farran 0
#> 3 alanah_cook 0
#> 4 alexander_drake 1117
#> 5 allen_her 1
#> 6 amirah_christian 0
#> 7 anas_el_bashir 1
#> 8 anna_beltran_moreno 0
#> 9 anna_choi 0
#> 10 araseli_billy 1
#> # ℹ 164 more rows
hotel_test |>
count(agent, .drop = FALSE)
#> # A tibble: 174 × 2
#> agent n
#> <fct> <int>
#> 1 aaron_marquez 1
#> 2 aayaat_al_farran 0
#> 3 alanah_cook 0
#> 4 alexander_drake 367
#> 5 allen_her 1
#> 6 amirah_christian 0
#> 7 anas_el_bashir 0
#> 8 anna_beltran_moreno 0
#> 9 anna_choi 0
#> 10 araseli_billy 1
#> # ℹ 164 more rows
Calculate the global mean of the outcome and use it for cases that aren’t seen in the training data set.
hotel_train |>
summarise(
mean = mean(avg_price_per_room),
.by = agent
)
#> # A tibble: 119 × 2
#> agent mean
#> <fct> <dbl>
#> 1 alexander_drake 144.
#> 2 kaylae_maxedon 62.5
#> 3 michael_mcdole 60.9
#> 4 devin_rivera_borrego 126.
#> 5 james_richards 78.6
#> 6 estela_bonilla 41.9
#> 7 charles_najera 109.
#> 8 reema_el_tamer 118.
#> 9 jawhara_al_azad 90.1
#> 10 not_applicable 84.1
#> # ℹ 109 more rows
Some of the levels have very low counts. We can’t have the same confidence in those means as the means calculated on high counts.
We use the global mean to account for 0 occurrences. Let us adjust the calculated mean by the global mean depending on the counts.
hotel_train |>
summarise(
mean = mean(avg_price_per_room),
n = n(),
.by = agent
) |>
arrange(agent)
#> # A tibble: 119 × 3
#> agent mean n
#> <fct> <dbl> <int>
#> 1 aaron_marquez 118. 2
#> 2 alexander_drake 144. 1117
#> 3 allen_her 65 1
#> 4 anas_el_bashir 99 1
#> 5 araseli_billy 40 1
#> 6 arhab_al_islam 35 7
#> 7 audray_tucker 76.0 38
#> 8 bernice_baltierra 71.3 35
#> 9 betzy_rodriguez 84.0 66
#> 10 brayan_guerrero 37.5 2
#> # ℹ 109 more rows
Partial pooling somewhat lowers the risk of overfitting since it tends to correct for agents with small sample sizes. It can’t correct for improper data usage or data leakage, though.
We have described this method solely based on analytical calculations (step_lencode()
), but you could arrive at similar numbers using a model-based approach by fitting a no-intercept generalized linear model. A hierarchical version would induce partial pooling.
recipe(avg_price_per_room ~ ., data = hotel_train) |>
step_lencode(
agent, country, company,
outcome = vars("avg_price_per_room"), smooth = TRUE,
) |>
prep() |>
bake(NULL) |>
select(agent, country, company)
#> # A tibble: 11,551 × 3
#> agent country company
#> <dbl> <dbl> <dbl>
#> 1 144. 108. 109.
#> 2 77.2 83.4 109.
#> 3 61.5 99.9 109.
#> 4 126. 108. 109.
#> 5 144. 108. 109.
#> 6 79.9 73.8 109.
#> 7 126. 99.9 109.
#> 8 69.8 108. 109.
#> 9 126. 99.9 109.
#> 10 144. 108. 109.
#> # ℹ 11,541 more rows
Apply target encoding to the data set, see how it affects different predictors, not just the ones we listed here.
step_lencode()
step_lencode_glm()
step_lencode_bayes()
step_lencode_mixed()
03:00
How can we represent the date column arrival_date
for our model?
When we use a date column in its native format, most models in R convert it to an integer.
We can re-engineer it as:
Explore the arrival_date
variable and its relation to avg_price_per_room
The lubridate package might provide helpful
05:00
arrival_date
date featuresusing step_date( features = c("year", "month", "dow", "decimal", "mday", "doy", "week", "semester", "quarter"))
arrival_date | year | month | dow | decimal | mday | doy | week | semester | quarter |
---|---|---|---|---|---|---|---|---|---|
2016-08-30 | 2016 | Aug | Tue | 2016.661 | 30 | 243 | 35 | 2 | 3 |
2016-10-22 | 2016 | Oct | Sat | 2016.806 | 22 | 296 | 43 | 2 | 4 |
2016-12-17 | 2016 | Dec | Sat | 2016.959 | 17 | 352 | 51 | 2 | 4 |
2017-02-13 | 2017 | Feb | Mon | 2017.118 | 13 | 44 | 7 | 1 | 1 |
2017-04-05 | 2017 | Apr | Wed | 2017.258 | 5 | 95 | 14 | 1 | 2 |
arrival_date
date featuresAdding label = FALSE
arrival_date | year | month | dow | decimal | mday | doy | week | semester | quarter |
---|---|---|---|---|---|---|---|---|---|
2016-08-30 | 2016 | 8 | 3 | 2016.661 | 30 | 243 | 35 | 2 | 3 |
2016-10-22 | 2016 | 10 | 7 | 2016.806 | 22 | 296 | 43 | 2 | 4 |
2016-12-17 | 2016 | 12 | 7 | 2016.959 | 17 | 352 | 51 | 2 | 4 |
2017-02-13 | 2017 | 2 | 2 | 2017.118 | 13 | 44 | 7 | 1 | 1 |
2017-04-05 | 2017 | 4 | 4 | 2017.258 | 5 | 95 | 14 | 1 | 2 |
Works the same as step_date()
but for measurements smaller than day: hour, hour12, am/pm, minute, second, decimal_day.
Adds indicators for holidays. See timeDate::listHolidays()
for supported holidays.
Apply date steps to the arrival_date
variable and try to see if we capture anything about avg_price_per_room
03:00
The numeric features make it easy to capture the end or beginning, but harder to do anything more granular.
The indicators mostly care about the day itself. No information about the lead-up or aftermath
Using extrasteps::step_time_event()
and the almanac package, we can create useful time features.
step_time_event()
Create a list of rules (last slide) and pass them to the rules
argument of extrasteps::step_time_event()
step_time_event()
as numerics These features still have the issue that they only attach value to the date itself.
We can attach values based on how far we are away from those dates.
step_date_before()
step_date_before()
- inversestep_date_after()
- inversestep_date_nearest()
- inverseAvoid crafting datetime features by hand if at all possible.
Dealing with uneven month lengths, leap days (leap seconds)
Or tried to define any event that doesn’t land on the same day of the week or date each year.
The first Sunday after the first full moon on or after the vernal equinox