Feature agglomeration vs. univariate selection using Scikit Learn

1. Import Libraries:

The required libraries are imported here:

  • The function load_iris is used to load the Iris dataset.
  • A class for univariate feature selection is called SelectKBest.
  • f_classif: A function that determines the sample’s ANOVA F-value.
  • A class for feature agglomeration is called FeatureAgglomeration.


from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.cluster import FeatureAgglomeration

2. Load the Iris Dataset:

After loading the Iris dataset, its characteristics are kept in X, while the target labels are kept in Y.


iris = load_iris()
X, y = iris.data, iris.target

3. Feature Agglomeration:

To lower the dataset’s dimensionality, feature agglomeration is used. With n_clusters set to 2, the algorithm will attempt to divide the characteristics into two clusters. X_reduced contains the converted data.


agglomeration = FeatureAgglomeration(n_clusters=2)
X_reduced = agglomeration.fit_transform(X)

4. Univariate Selection:

ANOVA F-value is used in the application of univariate feature selection. According to k=2, just the top two traits ought to be chosen. X_k_best is where the altered data is kept.


k_best = SelectKBest(f_classif, k=2)
X_k_best = k_best.fit_transform(X, y)

5. Display the Results:


print("Original Shape:", X.shape)
print("Agglomerated Shape:", X_reduced.shape)
print("Univariate Selection Shape:", X_k_best.shape)


Original Shape: (150, 4)
Agglomerated Shape: (150, 2)
Univariate Selection Shape: (150, 2)

6. Train the model using both dataset with Agglomerative clustered dataset


from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
# DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(criterion='entropy',
tree_clf.fit(X_reduced, y)
pred = tree_clf.predict(X_reduced)
print(classification_report(y, pred, target_names=iris.target_names))


              precision    recall  f1-score   support

setosa 1.00 1.00 1.00 50
versicolor 0.96 0.88 0.92 50
virginica 0.89 0.96 0.92 50

accuracy 0.95 150
macro avg 0.95 0.95 0.95 150
weighted avg 0.95 0.95 0.95 150

6. Train the model using both dataset with Univariate feature selection dataset


# DecisionTreeClassifier
from sklearn.metrics import classification_report
tree_clf = DecisionTreeClassifier(criterion='entropy',
tree_clf.fit(X_k_best, y)
pred = tree_clf.predict(X_k_best)
print(classification_report(y, pred, target_names=iris.target_names))


              precision    recall  f1-score   support

setosa 1.00 1.00 1.00 50
versicolor 0.91 0.98 0.94 50
virginica 0.98 0.90 0.94 50

accuracy 0.96 150
macro avg 0.96 0.96 0.96 150
weighted avg 0.96 0.96 0.96 150

As we can see from the above that Univariate feature selection has performed better as compare to agglomeratice clustering.

Selecting the most relevant characteristics for a given job is the aim of feature selection, a crucial stage in machine learning. Feature Agglomeration and Univariate Selection are two popular methods for feature selection in Scikit-Learn. These techniques aid in the reduction of dimensionality, increase model effectiveness, and maybe improve model performance.

