Random Forest Classifier in Machine Learning
Step 1: Loading dataset
python3
# importing required libraries # importing Scikit-learn library and datasets package from sklearn import datasets # Loading the iris plants dataset (classification) iris = datasets.load_iris() |
Step 2: Checking dataset content and features names present in it.
python3
print (iris.target_names) |
Output:
[‘setosa’ ‘versicolor’ ‘virginica’]
python3
print (iris.feature_names) |
Output:
[‘sepal length (cm)’, ’sepal width (cm)’, ’petal length (cm)’, ’petal width (cm)’]
Step 3: Train Test Split
python3
# dividing the datasets into two parts i.e. training datasets and test datasets X, y = datasets.load_iris( return_X_y = True ) # Splitting arrays or matrices into random train and test subsets from sklearn.model_selection import train_test_split # i.e. 70 % training dataset and 30 % test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30 ) |
Step 4: Import Random Forest Classifier module.
python3
# importing random forest classifier from assemble module from sklearn.ensemble import RandomForestClassifier import pandas as pd # creating dataframe of IRIS dataset data = pd.DataFrame({ 'sepallength' : iris.data[:, 0 ], 'sepalwidth' : iris.data[:, 1 ], 'petallength' : iris.data[:, 2 ], 'petalwidth' : iris.data[:, 3 ], 'species' : iris.target}) |
Overview of the Dataset
python3
# printing the top 5 datasets in iris dataset print (data.head()) |
Output:
sepallength sepalwidth petallength petalwidth species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
Step 5: Training of Model
python3
# creating a RF classifier clf = RandomForestClassifier(n_estimators = 100 ) # Training the model on the training dataset # fit function is used to train the model using the training sets as parameters clf.fit(X_train, y_train) # performing predictions on the test dataset y_pred = clf.predict(X_test) # metrics are used to find accuracy or error from sklearn import metrics print () # using metrics module for accuracy calculation print ( "ACCURACY OF THE MODEL:" , metrics.accuracy_score(y_test, y_pred)) |
Output:
ACCURACY OF THE MODEL: 0.9238095238095239
Step 6: Predictions
Python3
# predicting which type of flower it is. clf.predict([[ 3 , 3 , 2 , 2 ]]) |
Output:
array([0])
This implies it is setosa flower type as we got the three species or classes in our data set: Setosa, Versicolor, and Virginia.
Check the important features
Now we will also find out the important features or selecting features in the IRIS dataset by using the following lines of code.
python3
# using the feature importance variable import pandas as pd feature_imp = pd.Series(clf.feature_importances_, index = iris.feature_names).sort_values(ascending = False ) feature_imp |
Output:
petal length (cm) 0.440050
petal width (cm) 0.423437
sepal length (cm) 0.103293
sepal width (cm) 0.033220
dtype: float64
Random Forests in Python’s Scikit-Learn library come with a set of hyperparameters that allow you to fine-tune the behavior of the model. Understanding and selecting appropriate hyperparameters is crucial for optimizing model performance.
Random Forest Classifier Parameters
- n_estimators: Number of trees in the forest.
- More trees generally lead to better performance, but at the cost of computational time.
- Start with a value of 100 and increase as needed.
- max_depth: Maximum depth of each tree.
- Deeper trees can capture more complex patterns, but also risk overfitting.
- Experiment with values between 5 and 15, and consider lower values for smaller datasets.
- max_features: Number of features considered for splitting at each node.
- A common value is ‘sqrt’ (square root of the total number of features).
- Adjust based on dataset size and feature importance.
- criterion: Function used to measure split quality (‘gini’ or ‘entropy’).
- Gini impurity is often slightly faster, but both are generally similar in performance.
- min_samples_split: Minimum samples required to split a node.
- Higher values can prevent overfitting, but too high can hinder model complexity.
- Start with 2 and adjust as needed.
- min_samples_leaf: Minimum samples required to be at a leaf node.
- Similar to min_samples_split, but focused on leaf nodes.
- Start with 1 and adjust as needed.
- bootstrap: Whether to use bootstrap sampling when building trees (True or False).
- Bootstrapping can improve model variance and generalization, but can slightly increase bias.
Advantages of Random Forest Classifier
- The ensemble nature of Random Forests, combining multiple trees, makes them less prone to overfitting compared to individual decision trees.
- Effective on datasets with a large number of features, and it can handle irrelevant variables well.
- Random Forests can provide insights into feature importance, helping in feature selection and understanding the dataset.
Disadvantages of Random Forest Classifier
- Random Forests can be computationally expensive and may require more resources due to the construction of multiple decision trees.
- The ensemble nature makes it challenging to interpret the reasoning behind individual predictions compared to a single decision tree.
- In imbalanced datasets, Random Forests may be biased toward the majority class, impacting the predictive performance for minority classes.
Random Forest Classifier using Scikit-learn
In this article, we will see how to build a Random Forest Classifier using the Scikit-Learn library of Python programming language and to do this, we use the IRIS dataset which is quite a common and famous dataset.