Code Implementation of Feature Selection Using Random Forest Classifier
Step 1: Import Necessary Libraries
We import essential libraries for data manipulation, model building, and visualization.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Generate Synthetic Dataset
We generate a synthetic dataset with 1000 samples, 10 features, of which 5 are informative and 2 are redundant.
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=2, random_state=42)
# Convert to DataFrame for convenience
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
data = pd.DataFrame(X, columns=feature_names)
data['target'] = y
Step 3: Separate Features and Target Variable
We separate the features and the target variable for model training and evaluation.
# Separate features and target variable
X = data.drop('target', axis=1)
y = data['target']
Step 5: Train Random Forest Classifier and Calculate Initial Accuracy
We train a Random Forest classifier on the training set and evaluate its accuracy on the test set.
# Initialize and train the Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Make predictions and calculate accuracy before feature selection
y_pred = rf.predict(X_test)
initial_accuracy = accuracy_score(y_test, y_pred)
Step 6: Get and Visualize Feature Importances
We extract feature importances from the trained model and visualize them using a bar plot.
# Get feature importances
feature_importances = rf.feature_importances_
# Create a DataFrame for visualization
feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': feature_importances
})
# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance')
plt.show()
Output:
Step 7: Select Top Features
We select the top 5 features based on their importance scores and create new datasets with these selected features.
# Select top 5 features (as an example)
top_features = feature_importance_df.head(5)['Feature'].values
# Create a new dataset with only the top features
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]
Step 8: Train Classifier with Selected Features and Calculate Accuracy
We train a new Random Forest classifier using the selected features and evaluate its accuracy on the test set.
# Train the classifier with selected features
rf_selected = RandomForestClassifier(n_estimators=100, random_state=42)
rf_selected.fit(X_train_selected, y_train)
# Make predictions and calculate accuracy after feature selection
y_pred_selected = rf_selected.predict(X_test_selected)
selected_accuracy = accuracy_score(y_test, y_pred_selected)
print(f'Accuracy before feature selection: {initial_accuracy:.4f}')
print(f'Accuracy after feature selection: {selected_accuracy:.4f}')
Output:
Accuracy before feature selection: 0.9400
Accuracy after feature selection: 0.9433
The output highlights the effectiveness of feature selection using a Random Forest classifier on a synthetic dataset. Initially, the model trained with all 10 features achieved an accuracy of 94.00% on the test set. After selecting the top 5 most important features based on their importance scores, a new model was trained, resulting in a slightly improved accuracy of 94.33%. This improvement indicates that focusing on the most relevant features can enhance model performance by reducing noise and overfitting. Additionally, simplifying the model by reducing the number of features makes it computationally more efficient while maintaining or even improving its predictive power.
Feature Selection Using Random forest Classifier
Feature selection is a crucial step in the machine learning pipeline that involves identifying the most relevant features for building a predictive model. One effective method for feature selection is using a Random Forest classifier, which provides insights into feature importance. In this article, we will explore how to use a Random Forest classifier for feature selection, understand its benefits, and go through a practical example using Python.