Random Forest Classifier in Machine Learning

Step 1: Loading dataset


# importing required libraries
# importing Scikit-learn library and datasets package
from sklearn import datasets
# Loading the iris plants dataset (classification)
iris = datasets.load_iris()

Step 2: Checking dataset content and features names present in it.




[‘setosa’ ‘versicolor’ ‘virginica’]




[‘sepal length (cm)’, ’sepal width (cm)’, ’petal length (cm)’, ’petal width (cm)’]

Step 3: Train Test Split


# dividing the datasets into two parts i.e. training datasets and test datasets
X, y = datasets.load_iris( return_X_y = True)
# Splitting arrays or matrices into random train and test subsets
from sklearn.model_selection import train_test_split
# i.e. 70 % training dataset and 30 % test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

Step 4: Import Random Forest Classifier module.


# importing random forest classifier from assemble module
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# creating dataframe of IRIS dataset
data = pd.DataFrame({'sepallength': iris.data[:, 0], 'sepalwidth': iris.data[:, 1],
                     'petallength': iris.data[:, 2], 'petalwidth': iris.data[:, 3],
                     'species': iris.target})

Overview of the Dataset


# printing the top 5 datasets in iris dataset


     sepallength   sepalwidth   petallength     petalwidth   species
0          5.1             3.5               1.4                0.2           0
1          4.9             3.0               1.4                0.2           0
2          4.7             3.2               1.3                0.2           0
3          4.6             3.1               1.5               0.2            0
4          5.0             3.6               1.4               0.2            0

Step 5: Training of Model


# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
# metrics are used to find accuracy or error
from sklearn import metrics 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))


ACCURACY OF THE MODEL: 0.9238095238095239

Step 6: Predictions


# predicting which type of flower it is.
clf.predict([[3, 3, 2, 2]])



This implies it is setosa flower type as we got the three species or classes in our data set: Setosa, Versicolor, and Virginia.

Check the important features

Now we will also find out the important features or selecting features in the IRIS dataset by using the following lines of code.


# using the feature importance variable
import pandas as pd
feature_imp = pd.Series(clf.feature_importances_, index = iris.feature_names).sort_values(ascending = False)


petal length (cm)    0.440050
petal width (cm)     0.423437
sepal length (cm)    0.103293
sepal width (cm)     0.033220
dtype: float64

Random Forests in Python’s Scikit-Learn library come with a set of hyperparameters that allow you to fine-tune the behavior of the model. Understanding and selecting appropriate hyperparameters is crucial for optimizing model performance.

Random Forest Classifier Parameters

  • n_estimators: Number of trees in the forest.
    • More trees generally lead to better performance, but at the cost of computational time.
    • Start with a value of 100 and increase as needed.
  • max_depth: Maximum depth of each tree.
    • Deeper trees can capture more complex patterns, but also risk overfitting.
    • Experiment with values between 5 and 15, and consider lower values for smaller datasets.
  • max_features: Number of features considered for splitting at each node.
    • A common value is ‘sqrt’ (square root of the total number of features).
    • Adjust based on dataset size and feature importance.
  • criterion: Function used to measure split quality (‘gini’ or ‘entropy’).
    • Gini impurity is often slightly faster, but both are generally similar in performance.
  • min_samples_split: Minimum samples required to split a node.
    • Higher values can prevent overfitting, but too high can hinder model complexity.
    • Start with 2 and adjust as needed.
  • min_samples_leaf: Minimum samples required to be at a leaf node.
    • Similar to min_samples_split, but focused on leaf nodes.
    • Start with 1 and adjust as needed.
  • bootstrap: Whether to use bootstrap sampling when building trees (True or False).
    • Bootstrapping can improve model variance and generalization, but can slightly increase bias.

Advantages of Random Forest Classifier

  • The ensemble nature of Random Forests, combining multiple trees, makes them less prone to overfitting compared to individual decision trees.
  • Effective on datasets with a large number of features, and it can handle irrelevant variables well.
  • Random Forests can provide insights into feature importance, helping in feature selection and understanding the dataset.

Disadvantages of Random Forest Classifier

  • Random Forests can be computationally expensive and may require more resources due to the construction of multiple decision trees.
  • The ensemble nature makes it challenging to interpret the reasoning behind individual predictions compared to a single decision tree.
  • In imbalanced datasets, Random Forests may be biased toward the majority class, impacting the predictive performance for minority classes.

Random Forest Classifier using Scikit-learn

In this article, we will see how to build a Random Forest Classifier using the Scikit-Learn library of Python programming language and to do this, we use the IRIS dataset which is quite a common and famous dataset.

