Getting More Out of Feature Engineering and Tuning for Machine Learning
Getting set up
library(tidymodels)library(embed)library(extrasteps)tidymodels_prefer()theme_set(theme_bw())options(pillar.advice =FALSE, pillar.min_title_chars =Inf)# Load our example data for this section"https://github.com/tidymodels/workshops/raw/refs/heads/2025-GMOFETML/slides/leaf_data.RData"|>url() |>load()
This is used to combine infrequent levels together.
We have a step step_other() that does this for nominal variables, but it doesn’t work in these cases, as it has to happen after the extraction/combination.
While keeping our models in mind, we want to make sure the data is well-suited
Correlated data
hard for some models
lat/lon compared to distance/angle
hard for most models
Ineffective Representation
Ineffective Representation
Computational Speed
Depending on what method we are using and how the data is affected by it, we could see a large reduction in features. This, in turn, leads to a smaller model that is faster to train on.
Only exploration and trial and error can determine whether you should use dimensionality reduction techniques. Knowing which methods do what helps you determine what to try.
Dimensionality Reduction Method
Zero Variance removal
PCA
Truncated PCA
Sparse PCA
NNMF
UMAP
Isomap
Restrictions
All the methods shown today will not be able to handle
missing data
Non-numeric data
Why not t-SNE?
One of the main requirements for a feature engineering method is that you can reapply the trained transformation done on the training data set to the testing data set.
This is not possible with t-SNE as it is an iterative method that shifts observations in the lower-dimensional space based on their distances to points in the higher-dimensional space.
It doesn’t create a mapping that can be reused.
PCA
Principal Compoment Analysis is a linear combination of the original data such that most of the variation is captured in the first variable, then the second, then the third, and so on.
We are, in essence, maximizing the sample variance of the \(n\) values of \(z_{i1}\).
We refer to \(z_{11}, ..., z_{n1}\) as the scores of the first principal component.
PCA Algorithm
Luckily, this can be solved using techniques from Linear Algebra, more specifically, it can be solved using an eigen decomposition.
One of the main strengths of PCA is that you don’t need to use optimization to get the results without approximations.
PCA Algorithm
Once the first principal component is calculated, we can calculate the second principal component.
We find the second principal component \(Z_2\) as a linear combination of \(X_1, ..., X_p\) that has the maximal variance out of the linear combinations that are uncorrelated with \(Z_1\)
this is the same as saying that \(\phi_2\) should be orthogonal to the direction \(\phi_1\)
How is that a dimensionality reduction method?
By itself, it isn’t, as it rotates all the features in the feature space.
It becomes a dimensionality reduction method if we only calculate some of the principal components.
This is typically done by retaining a specific number of components or as a threshold on the variance explained.
Non-Negative Matrix Factorization is conceptually similar to PCA, but it has different objectives.
PCA aims to generate uncorrelated components that maximize the variances. One component at a time.
NNMF, on the other hand, simultaneously optimizes all the components under the constraint that all the loadings are non-negative. While the data is also non-negative.
Isometric mapping is a non-linear dimensionality reduction method.
This is another method that uses distances between points to produce graphs of neighboring points. Where this method is different than other methods is that it uses geodesic distances as opposed to straight-line distances.
The geodesic distance is the sum of edge weights along the shortest path between two points.
The eigenvectors of the deodesic distance metric are then used to represent the new coordinates.
Isomap Algorithm
A very high-level description of the Isomap algorithm is given below.
Find the neighbors for each point
Construct the neighborhood graph, using Euclidean distance as edge length
Calculate the shortest path between each pair of points
Use Multidimensional scaling to compute a lower-dimensional embedding
Isomap Pros and Cons
Pros
Captures non-linear effects
Captures long-range structure, not just local structure