Feature Engineering for Machine Learning (2024)

Welcome to Part 4 of ourData Science Primer.In this guide, we’ll see how we can performfeature engineeringto help out our algorithms and improve model performance. Remember, out of all the core steps in applied machine learning, data scientists usually spend the most time on feature engineering.

Feature Engineering for Machine Learning (1)

What is Feature Engineering?

Feature engineering is aboutcreating new input featuresfrom your existing ones. In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

All data scientists should master the process of engineering new features, for three big reasons:

  1. You can isolate and highlight key information, which helps your algorithms “focus” on what’s important.
  2. You can bring in your owndomain expertise.
  3. Most importantly, once you understand the “vocabulary” of feature engineering, you can bring in other people’s domain expertise!

In this guide, we will introduce severalheuristicsto help spark new ideas. Of course, this will not be an exhaustive compendium of all feature engineering, for which there are limitless possibilities. The good news is that this skill will naturally improve as you gain more experience.

Feature Engineering for Machine Learning (2)

Infuse Domain Knowledge

You can often engineer informative features by tapping into your (or others’) expertise about the domain. Try to think of specific information you might want toisolate or put the focus on. Here, you have a lot of “creative freedom” and “skill expression” as a data scientist.

For example, let’s say you’re working on a US real-estate model, using a dataset of historical prices going back to the 2000’s. Well, for this scenario, it’s important to remember that the subprime mortgage housing crisis occurred within that timeframe:

Feature Engineering for Machine Learning (3)

If you suspect that prices would be affected, you could create anindicator variablefor transactions during that period.​ Indicator variables are binary variables that can be either 0 or 1. They “indicate” if an observation meets a certain condition, and they are very useful for isolating key properties.

As you might suspect, “domain knowledge” is very broad and open-ended. At some point, you’ll get stuck or exhaust your ideas. That’s where these next few steps come in. These are a few specific heuristics that can help spark more.

Create Interaction Features

The first of these heuristics is checking to see if you can create anyinteraction featuresthat make sense. These are combinations of two or more features.

By the way, in some contexts, “interaction terms” must be products between two variables. In our context, interaction features can beproducts,sums, ordifferencesbetween two features.

A general tip is to look at each pair of features and ask yourself, “could I combine this information in any way that might be even more useful?”

Feature Engineering for Machine Learning (4)

Example (real-estate)

We know that quality and quantity of nearby schools will affect housing prices. So how can we ensure our ML model picks up on this?

  • Let’s say we already have a feature in the dataset called‘num_schools’, i.e. the numberof schools within 5 miles of a property.
  • Let’s say we also have the feature‘median_school’, i.e. the median quality score of those schools.

However, we might suspect that what’s really important ishaving many school options, but only if they are good.

  • Well, to capture that interaction, we could simple create a new feature ‘school_score’ =‘num_schools’x‘median_school’

This new ‘school_score’ feature would only have a high value (relatively) if both those conditions are met.

Combine Sparse Classes

The next heuristic we’ll consider is grouping sparse classes. Sparse classes(in categorical features) are those that have very few total observations. They can be problematic for certain machine learning algorithms, causing models to be overfit.

There’s no formal rule of how many observations each class needs. It also depends on the size of your dataset and the number of other features you have.

However, as arule of thumb, we recommend combining classes until each one has at least ~50 observations. As with any “rule” of thumb, use this as a guideline (not actually as arule).

Let’s take a look at the real-estate example:

Feature Engineering for Machine Learning (5)

To begin, we can groupsimilar classes. In the chart above, theexterior_wallsfeature has several classes that are quite similar.

  • We might want to group 'Wood Siding', 'Wood Shingle', and 'Wood' into a single class. In fact, let’s just label all of them as 'Wood'.

Next, we can group the remaining sparse classes into a single‘Other’ class, even if there’s already an ‘Other’ class.

  • We’d group 'Concrete Block', 'Stucco', 'Masonry', 'Other', and 'Asbestos shingle' into just 'Other'.

Here’s how the class distributions look after combining similar and other classes:

Feature Engineering for Machine Learning (6)

After combining sparse classes, we have fewer unique classes, but each one has more observations. Often, aneyeball testis enough to decide if you want to group certain classes together.

Add Dummy Variables

Most machine learning algorithms cannot directly handle categorical features. Specifically, they cannot handle text values. Therefore, we need to create dummy variables for our categorical features.

Dummy variablesare a set of binary (0 or 1) variables that each represent a single class from a categorical feature. The information you represent is exactly the same, but this numeric representation allows you to pass the technical requirements for algorithms.

In the example above, after grouping sparse classes, we were left with 8 classes, which translate to 8 dummy variables:

Feature Engineering for Machine Learning (7)

Remove Unused Features

Finally, we should remove unused or redundant features from the dataset.

Unusedfeatures are those that don’t make sense to pass into our machine learning algorithms. Examples include:

  • ID columns
  • Features that wouldn’t be available at the time of prediction
  • Other text descriptions

Redundantfeatures would typically be those that have been replaced by other features that you’ve added during feature engineering. For example, if you group a numeric feature into a categorical one, you can often improve model performance by removing the “distracting” original feature.

Analytical Base Table (ABT)

After completing Data Cleaning and Feature Engineering, you’ll have transformed your raw dataset into ananalytical base table (ABT). We call it an “ABT” because it’s what you’ll be building your models on.

Feature Engineering for Machine Learning (8)

As a final tip:Not all of the features you engineer need to be winners. In fact, you’ll often find that many of them don’t improve your model. That’s fine because one highly predictive feature makes up for ten duds.

The key is choosing machine learning algorithms that canautomatically select the best featuresamong many options (built-in feature selection). This will allow you toavoid overfittingyour model despite providing many input features. We’ll talk about this in the next core step of the Machine Learning Workflow:Algorithm Selection!

More About Feature Engineering

  • Best Practices for Feature Engineering
  • Python Data Wrangling Tutorial: Cryptocurrency Edition
  • Python for Data Science (Ultimate Quickstart Guide)

Read the rest of ourIntro to Data Science here.

Feature Engineering for Machine Learning (2024)

References

Top Articles
Latest Posts
Article information

Author: Greg O'Connell

Last Updated:

Views: 6029

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.