Feature agglomeration vs. univariate selection using Scikit Learn
1. Import Libraries:
The required libraries are imported here:
- The function load_iris is used to load the Iris dataset.
- A class for univariate feature selection is called SelectKBest.
- f_classif: A function that determines the sample’s ANOVA F-value.
- A class for feature agglomeration is called FeatureAgglomeration.
Python3
from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest, f_classif from sklearn.cluster import FeatureAgglomeration |
2. Load the Iris Dataset:
After loading the Iris dataset, its characteristics are kept in X, while the target labels are kept in Y.
Python3
iris = load_iris() X, y = iris.data, iris.target |
3. Feature Agglomeration:
To lower the dataset’s dimensionality, feature agglomeration is used. With n_clusters set to 2, the algorithm will attempt to divide the characteristics into two clusters. X_reduced contains the converted data.
Python3
agglomeration = FeatureAgglomeration(n_clusters = 2 ) X_reduced = agglomeration.fit_transform(X) |
4. Univariate Selection:
ANOVA F-value is used in the application of univariate feature selection. According to k=2, just the top two traits ought to be chosen. X_k_best is where the altered data is kept.
Python3
k_best = SelectKBest(f_classif, k = 2 ) X_k_best = k_best.fit_transform(X, y) |
5. Display the Results:
Python3
print ( "Original Shape:" , X.shape) print ( "Agglomerated Shape:" , X_reduced.shape) print ( "Univariate Selection Shape:" , X_k_best.shape) |
Output:
Original Shape: (150, 4)
Agglomerated Shape: (150, 2)
Univariate Selection Shape: (150, 2)
6. Train the model using both dataset with Agglomerative clustered dataset
Python3
from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report # DecisionTreeClassifier tree_clf = DecisionTreeClassifier(criterion = 'entropy' , max_depth = 2 ) tree_clf.fit(X_reduced, y) pred = tree_clf.predict(X_reduced) print (classification_report(y, pred, target_names = iris.target_names)) |
Output:
precision recall f1-score support
setosa 1.00 1.00 1.00 50
versicolor 0.96 0.88 0.92 50
virginica 0.89 0.96 0.92 50
accuracy 0.95 150
macro avg 0.95 0.95 0.95 150
weighted avg 0.95 0.95 0.95 150
6. Train the model using both dataset with Univariate feature selection dataset
Python3
# DecisionTreeClassifier from sklearn.metrics import classification_report tree_clf = DecisionTreeClassifier(criterion = 'entropy' , max_depth = 2 ) tree_clf.fit(X_k_best, y) pred = tree_clf.predict(X_k_best) print (classification_report(y, pred, target_names = iris.target_names)) |
Output:
precision recall f1-score support
setosa 1.00 1.00 1.00 50
versicolor 0.91 0.98 0.94 50
virginica 0.98 0.90 0.94 50
accuracy 0.96 150
macro avg 0.96 0.96 0.96 150
weighted avg 0.96 0.96 0.96 150
As we can see from the above that Univariate feature selection has performed better as compare to agglomeratice clustering.
Feature Agglomeration vs Univariate Selection in Scikit Learn
Selecting the most relevant characteristics for a given job is the aim of feature selection, a crucial stage in machine learning. Feature Agglomeration and Univariate Selection are two popular methods for feature selection in Scikit-Learn. These techniques aid in the reduction of dimensionality, increase model effectiveness, and maybe improve model performance.