Effect of Transforming the Targets in Regression Model

Importing required modules

At first, we will import all required Python modules like Pandas, Seaborn, Matplotlib and SKlearn etc.

Python3

import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import r2_score, mean_absolute_error 
import matplotlib.pyplot as plt 
from sklearn.preprocessing import QuantileTransformer 
import seaborn as sns 

This snippet of code analyzes a dataset using linear regression using the pandas and scikit-learn libraries in Python. The required modules are imported first, and the dataset is then read into a pandas DataFrame. The data is divided into training and testing sets, and then QuantileTransformer is used to scale the features. Using the transformed data, a linear regression model is trained, and evaluation metrics like mean absolute error and R-squared are computed. Using matplotlib and seaborn, the code ends with a visual representation of the relationship between the predicted and actual values.

Dataset loading and splitting

Python3

# Load House Prices dataset 
data = pd.read_csv('train.csv') 
  
# Select features and target 
X = data.drop('SalePrice', axis=1) 
y = data['SalePrice'] 
  
# Handle missing values and categorical features (customize as needed) 
X = X.fillna(0)  # Replace NaN values with 0 for simplicity 
X = pd.get_dummies(X) 
  
# Split the dataset 
X_train, X_test, y_train, y_test = train_test_split( 
    X, y, test_size=0.2, random_state=42) 

Now we will load a Kaggle dataset and after handling missing values we will split the dataset into training and testing sets(80:20). This code selects the target variable (y) and features (X) from the House Prices dataset that is loaded from Kaggle. NaNs are simply replaced with 0 to handle missing values. The get_dummies function in pandas is then used to one-hot encode the categorical features.

Quantile Transformation

Python3

# Linear Regression on Original Targets 
model_original = LinearRegression() 
model_original.fit(X_train, y_train) 
y_pred_original = model_original.predict(X_test) 
# QuantileTransformer 
quantile_transformer = QuantileTransformer(output_distribution='normal', random_state=42) 
y_train_transformed_quantile = quantile_transformer.fit_transform(y_train.values.reshape(-1, 1)).flatten() 
model_quantile = LinearRegression() 
model_quantile.fit(X_train, y_train_transformed_quantile) 
y_pred_quantile = model_quantile.predict(X_test) 
y_pred_quantile_inverse = quantile_transformer.inverse_transform(y_pred_quantile.reshape(-1, 1)).flatten()

Here we will use Quantile transformer as we have used a dataset of sales price in which we need to bypass the effect of outliers. And for model, we will use Liner Regression model. To implement quantile transformer we have handled the randomness of transformation subsets using its ‘random_state’ parameter and the distribution of output is set to ‘normal’ by ‘output_distribution’ parameter which means that after transformation the target will follow a normal distribution.

Comparative visualization of target variable

Here we will plot the distribution of target variable for raw dataset and the distribution of target variable after Quantile transformation.

Python3

# Plot the distribution of the original target 
plt.figure(figsize=(8, 4)) 
plt.subplot(1, 2, 1) 
sns.histplot(y, bins=30, kde=True, color='green') 
plt.title('Distribution of Original Target') 
plt.xlabel('SalePrice') 
plt.ylabel('Frequency') 
  
# Plot the distribution after transforming the target using QuantileTransformer 
plt.subplot(1, 2, 2) 
sns.histplot(y_train_transformed_quantile, bins=30, kde=True, color='green') 
plt.title('Distribution after Quantile Transformation') 
plt.xlabel('Transformed SalePrice') 
plt.ylabel('Frequency') 
plt.tight_layout() 
plt.show()

Output:

Comparative plot for raw target distribution and quantile transformed distribution

This code generates a comparison table that shows the distribution of the target variable “SalePrice” at initial values and after quantile transformation. The original target distribution is represented visually in the left subplot by a histogram with kernel density estimation. The transformed distribution is shown in the right subplot following the application of QuantileTransformer to the training set. By ensuring a more uniform distribution, the quantile transformation may help regression models that assume a normal distribution of the target variable perform better. Matplotlib and Seaborn are utilized for the visual aids.

Visualizing prediction and actual values

In this comparative plot, we will visualize the actual and predicted values for both Linear regression on raw targets and quantile transformed target.

Python3

# Plotting 
plt.figure(figsize=(12, 6)) 
  
# Original Predictions vs Actual 
plt.subplot(1, 2, 1) 
plt.scatter(y_test, y_pred_original) 
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2) 
plt.title('Linear Regression on Original Targets') 
plt.xlabel('Actual Values') 
plt.ylabel('Predicted Values') 
plt.grid(True) 
  
# QuantileTransformer Predictions vs Actual 
plt.subplot(1, 2, 2) 
plt.scatter(y_test, y_pred_quantile_inverse) 
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2) 
plt.title('Linear Regression on Quantile Transformed Targets') 
plt.xlabel('Actual Values') 
plt.ylabel('Predicted Values') 
plt.grid(True) 
plt.tight_layout() 
plt.show()

Output:

Comparative plot for actual vs. predicted values for raw and quantile targets

Using two scatter plots for linear regression predictions, this code generates a side-by-side comparison. Predictions and actual values using the original target variable are shown in the left subplot. For a perfect prediction line, use the red dashed line. Although the actual values and predictions are based on the targets that were transformed using the QuantileTransformer, the same comparison is depicted in the right subplot. A comparison of the impact of the target variable transformation on the model’s predictions can be made thanks to this visualization. For perfect predictions in both plots, use the red dashed line as a reference.

Performance evaluation

Now we will evaluate both the models in the terms of MAE and R2-Score.

Python3

# evaluation for model with raw targets 
mae_original = mean_absolute_error(y_test, y_pred_original) 
r2_original = r2_score(y_test, y_pred_original) 
print(f'Mean Absolute Error (Original): {mae_original:.2f}') 
print(f'R2-Score (Original): {r2_original:.2f}') 
# evaluation for model with tranformerd target 
y_pred_quantile = model_quantile.predict(X_test) 
y_pred_quantile_inverse = quantile_transformer.inverse_transform(y_pred_quantile.reshape(-1, 1)).flatten() 
mae_quantile = mean_absolute_error(y_test, y_pred_quantile_inverse) 
r2_quantile = r2_score(y_test, y_pred_quantile_inverse) 
print(f'Mean Absolute Error (Quantile Transformer): {mae_quantile:.2f}') 
print(f'R2-Score (Quantile Transformer): {r2_quantile:.2f}')

Output:

Mean Absolute Error (Original): 21131.84
R2-Score (Original): 0.44
Mean Absolute Error (Quantile Transformer): 15339.08
R2-Score (Quantile Transformer): 0.93

From this results, we can clearly see the positive effect of transforming the targets in regression problems. With raw target values the R2-score is moderately well with 44% and after transformation it achieves a notable R2-score of 93%. Also the MAE is greatly reduced after transformation.

Conclusion

In conclusion, the enhanced distributions and prediction visualizations demonstrate the effect of modifying targets in a regression model with Scikit-Learn. By using methods such as Quantile Transformation, one can improve model performance by reducing the impact of skewed target distributions. The scatter plots highlight the significance of careful preprocessing for precise and trustworthy regression modeling by illustrating how target transformations affect linear regression predictions.