Steps to Perform PCA for Removing Multicollinearity
We can go through the steps needed to implement PCA. They are as follows:
- Mean centering or normalizing the data so that each feature contributes equally to the analysis.
- Compute the covariance matrix to understand how the input data sets are varying from the mean w.r.t. each other (to identify correlation or inverse correlation).
- Compute the eigen vectors and eigen values of the covariance matrix to identify the principal components. Eigen vectors provide the direction of the axes with the most variance, and eigen values give the amount of variance in each PC.
- Identify the feature vector by discarding low eigen values (less significant).
- Reorient the data based on the feature vector (from the original axes to the principal component).
1. Implementing PCA to Remove Multicollinearity
Sklearn provides a handy class to implement PCA, so we don’t need to implement the above steps. Let’s apply principal component analysis (PCA) to the iris dataset.
from sklearn.decomposition import PCA
# applying PCA
pca_iris = PCA(n_components=3)
X_pca = pca_iris.fit_transform(iris.data)
Scikit-Learn’s PCA class makes use of SVD decomposition to implement PCA. Since we mentioned n_components as 3, the PCA will create 3 new features that are a linear combination of the 4 original features.
Let’s plot the irises across the three PCA dimensions. The code is as follows:
import matplotlib.pyplot as plt
fig = plt.figure(1, figsize=(7, 6))
ax = fig.add_subplot(111, projection="3d", elev=-155, azim=112)
# scatter plot of eigen vectors
sctr = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2],
c=iris.target, s=45,)
ax.legend(sctr.legend_elements()[0], iris.target_names,
loc="upper right", title="Classes")
ax.set_xlabel("PC1 (1st Eigenvector)")
ax.xaxis.set_ticklabels([])
ax.set_ylabel("PC2 (2nd Eigenvector)")
ax.yaxis.set_ticklabels([])
ax.set_zlabel("PC3 (3rd Eigenvector)")
ax.zaxis.set_ticklabels([])
plt.show()
Output:
Here we plotted the three PCA components against each other using Matplotlib’s scatter plot method.
We can check the explained variance ratio of each principal component.
exp_var_ratio = pca_iris.explained_variance_ratio_
print(exp_var_ratio)
Output:
[0.92461872 0.05306648 0.01710261]
From the output, we can conclude that 92.4% of the dataset’s variance lies along the first PC, 5.3% lies along the 2nd PC, and 1.7% lies along the third PC. The 2nd and 3rd PCs carry very little information.
Let’s plot the explained variance ratio as a bar graph for each principal component.
plt.figure(figsize=(6, 4))
plt.bar(range(3), exp_var_ratio, alpha=0.8,
align='center')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.tight_layout()
Output:
2. Training Logistic Regression with PCA
We already applied PCA to the iris dataset. Now we can train a model using logistic regression. Let’s split the PCA-applied iris dataset into training and test sets. The code is as follows:
from sklearn.model_selection import train_test_split
# split training and test set
X_train, X_test, y_train, y_test = train_test_split(
X_pca, iris.target, test_size = 0.3,
random_state=20, stratify=iris.target)
Here we used the train_test_split() method from Skelarn to split the dataset into a train and a test dataset. Now we use this dataset to train a logistic regression model. The code is as follows:
from sklearn.linear_model import LogisticRegression
# logistic regression model
log = LogisticRegression()
log.fit(X_train,y_train)
Using logistic regression, we trained our model. Let’s check the prediction score of our model using test data.
from sklearn.metrics import accuracy_score
# predict using test data
prediction=log.predict(X_test)
# calculate score using accuracy metric
ac_score = accuracy_score(prediction,y_test)
print('The accuracy score:', ac_score)
Output:
The accuracy score: 0.9777777777777777
Applying PCA to Logistic Regression to remove Multicollinearity
Multicollinearity is a common issue in regression models, where predictor variables are highly correlated. This can lead to unstable estimates of regression coefficients, making it difficult to determine the effect of each predictor on the response variable. Principal Component Analysis (PCA) is a powerful technique to address this issue by transforming the original correlated variables into a set of uncorrelated variables called principal components. This article explores how PCA can be applied to logistic regression to remove multicollinearity and improve model performance.
Table of Content
- Understanding Multicollinearity
- Principal Component Analysis (PCA) for Multicollinearity
- Detecting and Visualizing MultiCollinearity
- Visualizing Correlation with a Scatter Plot Diagram
- Calculating the Correlation Value
- Steps to Perform PCA for Removing Multicollinearity
- 1. Implementing PCA to Remove Multicollinearity
- 2. Training Logistic Regression with PCA