Methods to Calculate Feature Importance
There are several methods to calculate feature importance, each with its own advantages and applications. Here, we will explore some of the most common methods used in tree-based models.
1. Decision Tree Feature Importance
Decision trees, such as Classification and Regression Trees (CART), calculate feature importance based on the reduction in a criterion (e.g., Gini impurity or entropy) used to select split points. The importance score for each feature is the total reduction of the criterion brought by that feature.
Example: DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot as plt
# Define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
model = DecisionTreeClassifier()
model.fit(X, y)
# Get importance
importance = model.feature_importances_
# Summarize feature importance
for i, v in enumerate(importance):
print(f'Feature: {i}, Score: {v:.5f}')
plt.bar([x for x in range(len(importance))], importance)
plt.show()
Output:
Feature: 0, Score: 0.01078
Feature: 1, Score: 0.01851
Feature: 2, Score: 0.18831
Feature: 3, Score: 0.30516
Feature: 4, Score: 0.08657
Feature: 5, Score: 0.00733
Feature: 6, Score: 0.18437
Feature: 7, Score: 0.02780
Feature: 8, Score: 0.12904
Feature: 9, Score: 0.04215
2. Random Forest Feature Importance
Random forests are ensembles of decision trees. They calculate feature importance by averaging the importance scores of each feature across all the trees in the forest.
Example: RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot as plt
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
model = RandomForestClassifier()
model.fit(X, y)
# Get importance
importance = model.feature_importances_
# Summarize feature importance
for i, v in enumerate(importance):
print(f'Feature: {i}, Score: {v:.5f}')
# Plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()
Output:
Feature: 0, Score: 0.06806
Feature: 1, Score: 0.10468
Feature: 2, Score: 0.15456
Feature: 3, Score: 0.20209
Feature: 4, Score: 0.08275
Feature: 5, Score: 0.09979
Feature: 6, Score: 0.10596
Feature: 7, Score: 0.04535
Feature: 8, Score: 0.09206
Feature: 9, Score: 0.04471
3. Permutation Feature Importance
Permutation feature importance involves shuffling the values of each feature and measuring the decrease in model performance. This method can be applied to any machine learning model, not just tree-based models.
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Perform permutation importance
results = permutation_importance(model, X_test, y_test, scoring='accuracy')
for i, v in enumerate(results.importances_mean):
print(f'Feature: {i}, Score: {v:.5f}')
plt.bar([x for x in range(len(results.importances_mean))], results.importances_mean)
plt.show()
Output:
Feature: 0, Score: 0.00800
Feature: 1, Score: 0.06200
Feature: 2, Score: 0.12000
Feature: 3, Score: 0.10200
Feature: 4, Score: -0.00100
Feature: 5, Score: 0.03600
Feature: 6, Score: 0.01800
Feature: 7, Score: 0.00300
Feature: 8, Score: 0.03500
Feature: 9, Score: -0.00500
Understanding Feature Importance and Visualization of Tree Models
Feature importance is a crucial concept in machine learning, particularly in tree-based models. It refers to techniques that assign a score to input features based on their usefulness in predicting a target variable. This article will delve into the methods of calculating feature importance, the significance of these scores, and how to visualize them effectively.
Table of Content
- Feature Importance in Tree Models
- Methods to Calculate Feature Importance
- 1. Decision Tree Feature Importance
- 2. Random Forest Feature Importance
- 3. Permutation Feature Importance
- Demonstrating Visualization of Tree Models
- Yellowbrick for Visualization of Tree Models