Feature Engineering

  • Why should we use Feature Engineering in data science
  • Feature Selection
  • Handling missing values
  • Handling imbalanced data
  • Handling outliers, Binning
  • Encoding, Feature Scaling

Feature Engineering

Feature engineering is a topic every machine learning enthusiast has heard of. But the concept keeps eluding most people.

How in the world can you use feature engineering?

Why do we need the engineer features at all?

We know that machine learning algorithms use some input data to produce results. But quite often, the data you’ve been given might not be enough for designing a good machine learning model. That’s where the power of feature engineering comes into play.

Feature engineering has two goals primarily:

  • Preparing the proper input dataset, compatible with the machine learning algorithm requirements
  • Improving the performance of machine learning models

In this article, we’ll quickly go through 7 common feature engineering techniques that every machine learning professional should know.

List of Feature Engineering Techniques

  • Imputation
  • Handling Outliers
  • Binning
  • Log Transform
  • One-Hot Encoding
  • Grouping Operations
  • Scaling

1. Imputation

Missing values are one of the most common problems you can encounter when you prepare your data for machine learning. The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, etc. Whatever the reason, missing values affect the performance of machine learning models.

Some of the imputation operations you can perform are:

  • Numerical Imputation: Imputation is a more preferable option rather than dropping because it preserves the data size. However, there is an important selection of what you impute to the missing values. I suggest beginning with considering a possible default value of missing values in the column
  • Categorical Imputation: Replacing the missing values with the maximum occurred value in a column is a good option for handling categorical columns
  • Random sample imputation: This consists of taking random observation from the dataset and we use this observation to replace the NaN values
  • End of Distribution Imputation

2. Handling Outliers

Before mentioning how outliers can be handled, I want to state that the best way to detect outliers is to demonstrate the data visually. All other statistical methodologies are open to making mistakes, whereas visualizing the outliers gives a chance to take a decision with high precision.

  • Outlier in terms of Standard Deviation
  • If a value has a distance to the average higher than x * standard deviation, it can be assumed as an outlier
  • Outlier in terms of Percentiles

Percentiles according to the range of the data. In other words, if your data ranges from 0 to 100, your top 5% is not the values between 96 and 100. Top 5% means here the values that are out of the 95th percentile of data

3. Binning

Binning can be applied on both categorical and numerical data.

The main motivation of binning is to make the model more robust and prevent overfitting. However, it has a cost on the performance. Every time you bin something, you sacrifice information and make your data more regularized.

4. Log Transform

Logarithm transformation (or log transform) is one of the most commonly used mathematical transformations in feature engineering. Here are the benefits of using log transform:

  • It helps to handle skewed data and after transformation, the distribution becomes more approximate to normal
  • It also decreases the effect of the outliers due to the normalization of magnitude differences and the model become more robust
  • The data you apply log transform to must have only positive values, otherwise you receive an error

5. One-Hot Encoding

One-hot encoding is one of the most common encoding methods in machine learning. This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded column.

This method changes your categorical data, which is challenging to understand for algorithms, to a numerical format and enables you to group your categorical data without losing any information.

6. Grouping Operations

Using a pivot table or grouping based on aggregate functions using lambda.

Numerical columns are grouped using sum and mean functions in most of the cases.

7. Scaling

In most cases, the numerical features of the dataset do not have a certain range and they differ from each other. In order for a symmetric dataset, scaling is required.

  • Normalization

Normalization (or min-max normalization) scales all values in a fixed range between 0 and 1. This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. Therefore, before normalization, it is recommended to handle the outliers

  • Standardization

Standardization (or z-score normalization) scales the values while taking into account standard deviation. If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.

Standardizations are involved majorly where there is distance involved in Gradient Descent (Linear Regression, KNN, etc.) or in ANN for faster convergence while Normalization is involved in places of classification or CNN (for scaling down the pixel values).

End Notes

This was a quick overview of the different feature engineering techniques are our disposal. This is in no way an exhaustive list but is good enough to get you started.

You may also like...