Feature Engineering

Feature engineering involves creating new features or transforming existing ones to improve the performance of models. Well-engineered features tend to give the models the biggest performance boost compared to algorithmic techniques such as hyperparameter tuning. This chapter explores the types of features, common feature engineering operations, strategies to avoid data leakage, and best practices for creating robust and generalizable features.

Learned Features Versus Engineered Features

In machine learning, features can be broadly categorized into two types: learned features and engineered features.

Learned Features: These are features automatically extracted by models, typically deep learning models, during the training process. Convolutional Neural Networks (CNNs) for image data and Large Language Models (LLMs) for text are examples where features are learned from the raw input data. However, not all features can be automatically learned.
Engineered Features: These are features created manually based on domain-specific knowledge and insights into the data. Feature engineering involves transforming raw data into meaningful inputs that can improve the performance of machine learning algorithms.

While learned features can capture complex patterns directly from raw data, engineered features leverage human expertise and can often lead to more interpretable models.

Common Feature Engineering Operations

This section presents some of the most common feature engineering operations, but is nowhere near being comprehensive.

Handling Missing Values

Missing values are a common issue in datasets and need to be addressed, but not all missing values are equal:

Missing not at random (MNAR): This is when the reason a value is missing is because of the true value itself. For instance, respondents to a form with higher income might be less prone to disclose it.
Missing at random (MAR): This is when the reason a value is missing is not due to the value itself, but due to another observed variable. For instance, respondents from gender A might be less prone to disclosure their age.
Missing completely at random (MCAR): This is when there's no pattern in when the value is missing.

There are two primary methods to handle missing values: deletion and imputation.

Deletion

Column deletion: Remove the feature that have high missing value rate. The drawback is that you can be removing important information and reducing the model's accuracy.
Row deletion: Remove the sample that has missing value(s). This method works best when the missing values are completely at random (MCAR) and the number of examples with missing values is small.

Imputation

Default value: Fill the missing values with the default value. For example, if the job is missing, you might fill it with an empty string “".
Mean, Median or Mode Imputation: Replace missing values with the mean, median, or mode (most frequent value) of the non-missing values. This method is simple but can introduce bias if the data is not normally distributed.
Advanced Techniques: Use more sophisticated models like interpolation, regression, iterative imputation, or machine learning algorithms (such as KNN) to predict and fill in missing values.

In general, you want to avoid filling missing values with possible values, such as the missing number of children with 0. It makes it hard to distinguish between people whose information is missing and people who don't have children.

Scaling

Scaling transforms features to a common scale without distorting differences in the range of values. This is particularly important for algorithms sensitive to the scale of input data, such as Support Vector Machines (SVM), gradient-boosted trees, and logistic regression.

Min-Max Scaling: Scale features to a fixed range, typically $[0, 1]$ , but you can define arbitrary values $[a, b]$ (i.e., $[-1, 1]$ ).
$\begin{gather*} \tilde{x} = \frac{x - \min(x)}{\max(x) - \min(x)} \\ or \\ \tilde{x} = a + \frac{(x - \min(x))(b-a)}{\max(x) - \min(x)} \\ \end{gather*}$
Standardization: Scale features to have a mean of $\bar{x}$ and a standard deviation of $\sigma$ . If you think that your variable might follow a normal distribution, it might be helpful to normalize them with zero mean and unit variance.
$\tilde{x} = \frac{x-\bar{x}}{\sigma}$
Log Scaling: This method is useful for transforming skewed data to reduce the impact of outliers and make the distribution more normal. It involves taking the logarithm of the feature values.
$\begin{gather} \tilde{x} = \log(x) \\ or \\ \tilde{x} = \log(x + c) \end{gather}$
Where $c$ is a constant added to ensure all values are positive before applying the logarithm.

Attention

Scaling is a major source of data leakage (covered in the next section).
It requires global statistics, calculated with training data, and saved to be used in test and inference. If the new data has changed significantly compared to the training, these statistics won't be very helpful. Therefore, it's important to retrain your model often to account for these changes.

Discretization

Discretization (or quantization) transforms continuous features into discrete buckets or bins. This can help models learn simpler representations and make them more interpretable. By converting continuous variables into discrete categories, we can enable models to treat similar feature values uniformly, potentially improving performance on certain tasks.

Encoding Categorical Features

Categorical features must be encoded into numerical values before being used in machine learning models. One challenge is that categories are not always static. For instance, if you treat product brands as categories, new brands can emerge that your model didn't encounter during training. You can create an “UNKNOWN” category in your training data to prevent the model from crashing, but this approach treats all unseen brands, whether luxurious or sketchy, the same.

The Hashing Trick

One solution for handling dynamic categories is the hashing trick. A hash function generates a hashed value for each category, which becomes the index for that category. By specifying the hash space, you can fix the number of encoded values for a feature in advance, without needing to know the exact number of categories.

One problem with hashed functions is that different categories may hash to the same index. However, collisions are random and spread across the hash space. According to research done by Booking.com, even with 50% colliding features, the performance loss was less than 0.5%. Source

Feature Crossing

Feature crossing creates new features by combining existing ones to capture interactions between features. This can help models learn non-linear relationships in the data. It's helpful with model that are bad or even can't learn non-linear representations, such as linear regression, logistic regression, and tree-based models. It's less helpful in neural networks, but it could still help the model learn faster. DeepFM and xDeepFM are the family of models that have successfully leverage explicit feature interactions for recommender systems and click-through rate prediction.

Discrete and Continuous Positional Embeddings

Embedding

An embedding represents a piece of data as a vector. Word embeddings, for instance, map words to vectors in a continuous space. Similarly, positional embeddings map the position of each token in a sequence to a vector.

Introduced in the paper “Attention Is All You Need” (Vaswani et al., 2017), positional embeddings are essential for tasks in NLP and computer vision. They help models understand the order of inputs.

In models like transformers, words are processed in parallel, so positional information must be explicitly provided. We avoid using absolute positions (0, 1, 2, …) directly because neural networks don't perform well with such inputs.

Learned Position Embeddings

One way to handle position embeddings is to treat it like we'd treat word embedding, by using an embedding matrix. Each position gets an embedding that is learned during training.

Fixed Position Embeddings

Another way to handle position embeddings is to use predefined functions, typically sine and cosine, to encode positions. This method, from the original Transformer paper, ensures positional embeddings capture relative positions effectively.

Fixed embeddings can be extended to continuous spaces using Fourier features, which are effective for tasks involving coordinates (or positions).

Data Leakage

Data leakage occurs when information (i.e., the label) “leaks” from outside into the training dataset is used to create the model, leading to overly optimistic performance estimates during training but poor performance in production. It can happen because of how data are collected, handled, or even due to the innate origin of data (i.e., hospital A always sends patients with suspect of having lung cancer to a specific CT scan machine that outputs slightly different images).

Common Causes for Data Leakage

Splitting time-correlated data randomly instead of by time: When working with time-series data (or time-correlated data, e.g., the time the data is generated affects its label distribution), splitting the data randomly can cause the model to learn from future information. Always split by time to ensure the model only has access to past information during training.
Scaling before splitting: Performing scaling operations on the entire dataset before splitting can leak information from the test set into the training set. Scale the training data independently and apply the same transformation to the test set.
Filling in missing data with statistics from the test split: Imputing missing values using statistics (e.g., mean, median) computed from the entire dataset can lead to leakage. Compute statistics only from the training data and use them to fill missing values in both training and test sets.
Poor handling of data duplication before splitting: Duplicates in the dataset can cause leakage if not handled properly. Remove duplicates before splitting the data to ensure that no information from the test set influences the training process.
Group leakage: When samples are grouped together, information from one sample can inadvertently leak into another if not split correctly. For example, in object detection task, photos taken milliseconds apart may land in different splits. Ensure that groups of related samples are kept entirely within either the training or test set.
Leakage from data generation process: If the process used to generate the data unintentionally includes target information, it can cause leakage, just like in the CT scan machine example. Review the data generation process to ensure that no target information is inadvertently included in the features, and don't forget to include subject matter experts, who have more contexts on how data is collected and used.

Detecting Data Leakage

Investigate the importance of each feature. If a feature has an unusually high importance that doesn't make logical sense, it might be leaking information from the target. Keep an eye out the impact of new features on you model's performance, if it improves a lot, the feature is very good or just contains leakage. Finally, be careful every time you look at the test split, only use it to report a model's final performance.

Engineering Good Features

Effective feature engineering can significantly enhance the performance and robustness of machine learning models. This involves creating features that are both informative and generalizable while minimizing the number of features needed to train the model. By focusing on fewer, high-quality features, we reduce the risk of data leakage, decrease the likelihood of overfitting, and lower the memory requirements for serving the model. Additionally, this approach diminishes latency in feature extraction, particularly for online processing, and reduces technical debt, making the overall system more efficient and maintainable.

Feature Importance

Model-Based Methods: Algorithms like Random Forest, Gradient Boosting, and XGBoost provide built-in feature importance metrics. These methods assess the importance of each feature based on how often they are used to make splits in decision trees or their impact on the loss function.
SHAP Values (SHapley Additive exPlanations): SHAP values provide a unified measure of feature importance by considering the contribution of each feature to every prediction made by the model.
LIME (Local Interpretable Model-agnostic Explanations): LIME approximates the model locally around a prediction to understand the contribution of each feature.

Feature Generalization

Feature generalization ensures that the features used in the model are not overly specific to the training data and can generalize well to new, unseen data. Overall, there are two aspects to consider regarding generalization: feature coverage and the distribution of feature values.

Coverage: Feature coverage refers to the extent to which the features represent the entire input space. Ensuring broad coverage means that the model has seen a diverse range of examples during training, making it more likely to perform well on new data.
Distribution: The distribution of feature values should be consistent between the training and test datasets. Features should not only cover the input space broadly but also follow the same statistical properties across different datasets.

Learned Features Versus Engineered Features​

Common Feature Engineering Operations​

Handling Missing Values​

Scaling​

Discretization​

Encoding Categorical Features​

Feature Crossing​

Discrete and Continuous Positional Embeddings​

Data Leakage​

Common Causes for Data Leakage​

Detecting Data Leakage​

Engineering Good Features​

Feature Importance​

Feature Generalization​