Effect of Transforming the Targets in Regression Model

Importing required modules

At first, we will import all required Python modules like Pandas, Seaborn, Matplotlib and SKlearn etc.


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import QuantileTransformer
import seaborn as sns

This snippet of code analyzes a dataset using linear regression using the pandas and scikit-learn libraries in Python. The required modules are imported first, and the dataset is then read into a pandas DataFrame. The data is divided into training and testing sets, and then QuantileTransformer is used to scale the features. Using the transformed data, a linear regression model is trained, and evaluation metrics like mean absolute error and R-squared are computed. Using matplotlib and seaborn, the code ends with a visual representation of the relationship between the predicted and actual values.

Dataset loading and splitting


# Load House Prices dataset
data = pd.read_csv('train.csv')
# Select features and target
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']
# Handle missing values and categorical features (customize as needed)
X = X.fillna(0# Replace NaN values with 0 for simplicity
X = pd.get_dummies(X)
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Now we will load a Kaggle dataset and after handling missing values we will split the dataset into training and testing sets(80:20). This code selects the target variable (y) and features (X) from the House Prices dataset that is loaded from Kaggle. NaNs are simply replaced with 0 to handle missing values. The get_dummies function in pandas is then used to one-hot encode the categorical features.

Quantile Transformation


# Linear Regression on Original Targets
model_original = LinearRegression()
model_original.fit(X_train, y_train)
y_pred_original = model_original.predict(X_test)
# QuantileTransformer
quantile_transformer = QuantileTransformer(output_distribution='normal', random_state=42)
y_train_transformed_quantile = quantile_transformer.fit_transform(y_train.values.reshape(-1, 1)).flatten()
model_quantile = LinearRegression()
model_quantile.fit(X_train, y_train_transformed_quantile)
y_pred_quantile = model_quantile.predict(X_test)
y_pred_quantile_inverse = quantile_transformer.inverse_transform(y_pred_quantile.reshape(-1, 1)).flatten()

Here we will use Quantile transformer as we have used a dataset of sales price in which we need to bypass the effect of outliers. And for model, we will use Liner Regression model. To implement quantile transformer we have handled the randomness of transformation subsets using its ‘random_state’ parameter and the distribution of output is set to ‘normal’ by ‘output_distribution’ parameter which means that after transformation the target will follow a normal distribution.

Comparative visualization of target variable

Here we will plot the distribution of target variable for raw dataset and the distribution of target variable after Quantile transformation.


# Plot the distribution of the original target
plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
sns.histplot(y, bins=30, kde=True, color='green')
plt.title('Distribution of Original Target')
# Plot the distribution after transforming the target using QuantileTransformer
plt.subplot(1, 2, 2)
sns.histplot(y_train_transformed_quantile, bins=30, kde=True, color='green')
plt.title('Distribution after Quantile Transformation')
plt.xlabel('Transformed SalePrice')


Comparative plot for raw target distribution and quantile transformed distribution

This code generates a comparison table that shows the distribution of the target variable “SalePrice” at initial values and after quantile transformation. The original target distribution is represented visually in the left subplot by a histogram with kernel density estimation. The transformed distribution is shown in the right subplot following the application of QuantileTransformer to the training set. By ensuring a more uniform distribution, the quantile transformation may help regression models that assume a normal distribution of the target variable perform better. Matplotlib and Seaborn are utilized for the visual aids.

Visualizing prediction and actual values

In this comparative plot, we will visualize the actual and predicted values for both Linear regression on raw targets and quantile transformed target.


# Plotting
plt.figure(figsize=(12, 6))
# Original Predictions vs Actual
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred_original)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2)
plt.title('Linear Regression on Original Targets')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
# QuantileTransformer Predictions vs Actual
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_quantile_inverse)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2)
plt.title('Linear Regression on Quantile Transformed Targets')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')


Comparative plot for actual vs. predicted values for raw and quantile targets

Using two scatter plots for linear regression predictions, this code generates a side-by-side comparison. Predictions and actual values using the original target variable are shown in the left subplot. For a perfect prediction line, use the red dashed line. Although the actual values and predictions are based on the targets that were transformed using the QuantileTransformer, the same comparison is depicted in the right subplot. A comparison of the impact of the target variable transformation on the model’s predictions can be made thanks to this visualization. For perfect predictions in both plots, use the red dashed line as a reference.

Performance evaluation

Now we will evaluate both the models in the terms of MAE and R2-Score.


# evaluation for model with raw targets
mae_original = mean_absolute_error(y_test, y_pred_original)
r2_original = r2_score(y_test, y_pred_original)
print(f'Mean Absolute Error (Original): {mae_original:.2f}')
print(f'R2-Score (Original): {r2_original:.2f}')
# evaluation for model with tranformerd target
y_pred_quantile = model_quantile.predict(X_test)
y_pred_quantile_inverse = quantile_transformer.inverse_transform(y_pred_quantile.reshape(-1, 1)).flatten()
mae_quantile = mean_absolute_error(y_test, y_pred_quantile_inverse)
r2_quantile = r2_score(y_test, y_pred_quantile_inverse)
print(f'Mean Absolute Error (Quantile Transformer): {mae_quantile:.2f}')
print(f'R2-Score (Quantile Transformer): {r2_quantile:.2f}')


Mean Absolute Error (Original): 21131.84
R2-Score (Original): 0.44
Mean Absolute Error (Quantile Transformer): 15339.08
R2-Score (Quantile Transformer): 0.93

From this results, we can clearly see the positive effect of transforming the targets in regression problems. With raw target values the R2-score is moderately well with 44% and after transformation it achieves a notable R2-score of 93%. Also the MAE is greatly reduced after transformation.


In conclusion, the enhanced distributions and prediction visualizations demonstrate the effect of modifying targets in a regression model with Scikit-Learn. By using methods such as Quantile Transformation, one can improve model performance by reducing the impact of skewed target distributions. The scatter plots highlight the significance of careful preprocessing for precise and trustworthy regression modeling by illustrating how target transformations affect linear regression predictions.

Effect of Transforming the Targets in Regression Model

Regression modelling plays a crucial role in predicting numerical outcomes and understanding the relationships between variables. One key aspect of building robust regression models is the careful consideration of the target variable, as its distribution and characteristics can significantly impact model performance. In this article, we will discuss the effect of transforming the targets in regression modelling and their benefits.

Similar Reads

Why Transform Targets?

We need to perform target variable transformations in real-world regression-based regression datasets to address issues like non-linearity, heteroscedasticity, and skewed distributions. These complex patterns can’t be handled by linear and low-standard tree-based regression models as they blindly assume a linear relationship between predictors and the target variable. Transformation can help to mitigate these issues and improve the model’s ability to capture complex patterns. Some of the key benefits of transforming targets for regression problems are listed below:...

Transformation Methods

Now we will discuss some of the common transformation methods below:...

Effect of Transforming the Targets in Regression Model

Importing required modules...