Feature Engineering for the Machine Learning Models

Updated: at 08:12 AM


Feature Engineering

When we talk about machine learning, we often talk about the algorithms and models. But the most important part of machine learning is the data itself. The quality of the data is the most important factor in determining the quality of the model. The data should be clean, relevant, and should have the right features.

The process of selecting the right features and transforming the data into a format that is suitable for the model is called feature engineering.


The first step in feature engineering is pre-processing. This involves cleaning the data, handling missing values, and encoding categorical variables.

Correlation Analysis

Correlation analysis is used to identify the relationship between the features. It is important to remove features that are highly correlated as they can cause overfitting. For example: Height and Weight are correlated when it comes to your physique — as height increases, the weight tends to increase too. If we observe an individual who is unusually tall, we can also conclude that his weight is also above the average.

Correlation Coefficient

The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. A value of 1.0 indicates a perfect positive relationship, while -1.0 indicates a perfect negative relationship. A value of 0.0 indicates no relationship.


Feature Selection

Feature selection is the process of selecting the most important features from the dataset. This is done to reduce the dimensionality of the dataset and to improve the performance of the model. There are several techniques for feature selection:


Normalization is the process of scaling the features so that they have a mean of 0 and a standard deviation of 1. This is done to ensure that the features are on the same scale and to improve the performance of the model.

Normalization of numeric variables can help the learning process if there are very large range differences between numeric variables because variables with the highest magnitude could dominate the ML model, whether the feature is informative regarding the target or not. - Example: Consider a dataset with a feature called age that ranges between 18-35 and a product price that ranges between $50 – $5,000. Since the product price has a significantly larger value than the age, the model will treat the product price with “more importance”. This would have a negative impact on the model’s ability to classify data correctly. - That means the model will produce low precision and low accuracy scores.

Standardization is used to center the data by removing the mean (mean becomes 0) and scaling to unit variance (standard deviation becomes 1). This is done to ensure that the features are on the same scale and to improve the performance of the model.

One-Hot Encoding

For categorical values


converted to



Binning is the process of converting continuous variables into categorical variables. This is done to reduce the complexity of the model and to improve the performance of the model.

Types of Binning

  1. Categorical Binning: The categorical binning processor takes two inputs, a numerical variable and a parameter called bin number, and outputs a categorical variable. The purpose is to discover non-linearity in the variable’s distribution by grouping observed values together.
  2. Numerical Binning: The numerical binning processor takes two inputs, a numerical variable and a parameter called bin number, and outputs a numerical variable. The purpose is to discover non-linearity in the variable’s distribution by grouping observed values together.
  3. Quantile Binning: The quantile binning processor takes two inputs, a numerical variable and a parameter called bin number, and outputs a categorical variable. The purpose is to discover non-linearity in the variable’s distribution by grouping observed values together.” Because Quantile binning is used to create uniform bins of classifications, it would be the right choice to give you uniform age classifications that are limited in number. For example, you could create classification bins such as: Under 30, 30 to 50, Over 50. Or even better: Millennial, Generation X, Baby Boomer, etc.


Regularization is the process of adding a penalty term to the loss function to prevent overfitting. There are two types of regularization:

L1 (Lasso) Regularization

This is also known as Lasso regularization. It adds the absolute value of the coefficients to the loss function. This results in sparse coefficients, which means that some coefficients are set to zero.

L2 (Ridge) Regularization

This is also known as Ridge regularization. It adds the square of the coefficients to the loss function. This results in small coefficients, which means that the coefficients are close to zero.

FeaturesIf goal is to use few featuresIf you want to consider all features
PurposeDimensionality Reduction
EfficiencyComputationally efficient
WhenIf feature has outliers, OverfittingOverfitting

Non-Adequate Features

Imputation Techniques

Imputation is the process of replacing missing values with a value. There are several techniques for imputation:


Sample Data




import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

data = pd.read_csv('housing.csv')

# Feature Engg: Convert categorical columns to numeric using one-hot encoding
categorical_features = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea', 'furnishingstatus']
numeric_features = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']
preprocessor = ColumnTransformer(transformers=[ # ColumnTransformer allows to define Numerical and Categorical values in a single step.
        ('num', 'passthrough', numeric_features),
        ('cat', OneHotEncoder(), categorical_features)

model = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', LinearRegression())]) # Define the model

# Split data into training and testing sets
X = data.drop('price', axis=1)
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0), y_train) # Fit the model - training the model


import pandas as pd

new_data = pd.DataFrame({
    'area': [1000],
    'bedrooms': [2],
    'bathrooms': [4],
    'stories': [3],
    'mainroad': ['yes'],
    'guestroom': ['no'],
    'basement': ['yes'],
    'hotwaterheating': ['no'],
    'airconditioning': ['yes'],
    'parking': [2],
    'prefarea': ['no'],
    'furnishingstatus': ['semi-furnished']

predicted_price = model.predict(new_data)
formatted_price = f"{predicted_price[0]:,.0f}"

print(f"The predicted price of the house is: ${formatted_price}")
# The predicted price of the house is: $7,799,589


Feature engineering is the most important part of machine learning. However, it is often overlooked. It is important to spend time on feature engineering to ensure that the data is clean, relevant, and has the right features. This will help to improve the performance of the model and to make better predictions.

