taxi_train#> # A tibble: 8,000 × 7#> tip distance company local dow month hour#> <fct> <dbl> <fct> <fct> <fct> <fct> <int>#> 1 yes 17.2 Chicago Independents no Thu Feb 16#> 2 yes 0.88 City Service yes Thu Mar 8#> 3 yes 18.1 other no Mon Feb 18#> 4 yes 12.2 Chicago Independents no Sun Mar 21#> 5 yes 0.94 Sun Taxi yes Sat Apr 23#> 6 yes 17.5 Flash Cab no Fri Mar 12#> 7 yes 17.7 other no Sun Jan 6#> 8 yes 1.85 Taxicab Insurance Agency Llc no Fri Apr 12#> 9 yes 0.53 Sun Taxi no Tue Mar 18#> 10 yes 6.65 Taxicab Insurance Agency Llc no Sun Apr 11#> # ℹ 7,990 more rows
Working with other models
Some models can’t handle non-numeric data
Linear Regression
K Nearest Neighbors
Some models struggle if numeric predictors aren’t scaled
K Nearest Neighbors
Anything using gradient descent
Types of needed preprocessing
Do qualitative predictors require a numeric encoding?
Should columns with a single unique value be removed?
Does the model struggle with missing data?
Does the model struggle with correlated predictors?
Should predictors be centered and scaled?
Is it helpful to transform predictors to be more symmetric?
Two types of preprocessing
Two types of preprocessing
General definitions
Data preprocessing are the steps that you take to make your model successful.
Feature engineering are what you do to the original predictors to make the model do the least work to perform great.
Working with dates
Datetime variables are automatically converted to an integer if given as a raw predictor. To avoid this, it can be re-encoded as:
Days since a reference date
Day of the week
Month
Year
Leap year
Indicators for holidays
Your turn
What other transformations could we do with the raw time variable?
Remember that the transformations are tied to the specific modeling problem.
03:00
Two types of transformations
Static
Square root, log, inverse
Dummies for known levels
Date time extractions
Trained
Centering & scaling
Imputation
PCA
Anything for unknown factor levels
Trained methods need to calculate sufficient information to be applied again.
Every preprocessing step in a recipe that involved calculations uses the training set. For example:
Levels of a factor
Determination of zero-variance
Normalization
Feature extraction
Once a recipe is added to a workflow, this occurs when fit() is called.
Debugging a recipe
Typically, you will want to use a workflow to estimate and apply a recipe.
If you have an error and need to debug your recipe, the original recipe object (e.g. taxi_rec) can be estimated manually with a function called prep(). It is analogous to fit(). See TMwR section 16.4.
Another function, bake(), is analogous to predict(), and gives you the processed data back.
Your turn
Take the recipe and prep() then bake() it to see what the resulting data set looks like.
Try removing steps to see how the result changes.
05:00
Printing a recipe
taxi_rec#> #> ── Recipe ────────────────────────────────────────────────────────────#> #> ── Inputs#> Number of variables by role#> outcome: 1#> predictor: 6#> #> ── Operations#> • Unknown factor level assignment for: all_nominal_predictors()#> • Dummy variables from: all_nominal_predictors()#> • Zero variance filter on: all_predictors()#> • Log transformation on: distance#> • Centering and scaling for: all_numeric_predictors()
Prepping a recipe
prep(taxi_rec)#> #> ── Recipe ────────────────────────────────────────────────────────────#> #> ── Inputs#> Number of variables by role#> outcome: 1#> predictor: 6#> #> ── Training information#> Training data contained 8000 data points and no incomplete rows.#> #> ── Operations#> • Unknown factor level assignment for: company and local, ... |#> Trained#> • Dummy variables from: company, local, dow, month | Trained#> • Zero variance filter removed: company_unknown, ... | Trained#> • Log transformation on: distance | Trained#> • Centering and scaling for: distance and hour, ... | Trained