Implementation with CatBoostClassifier using various parameters on iris dataset
Import the Required Libraries
Python
import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, log_loss, roc_auc_score, classification_report from catboost import CatBoostClassifier, Pool |
Here we are importing some of the libraries such as numpy, pandas, classification metrics and some of the catboost libraries.
CatboostClassifier: A gradient boosting technique designed specifically for classification applications is the “CatBoostClassifier.” The CatBoost library contains it, which is an acronym for “categorical boosting.” CatBoost is well-known for its excellent performance and user-friendliness, and it works especially well with category characteristics.
Pool: The pool data structure in CatBoost is utilized to handle data efficiently for both training and evaluation. It includes features like custom feature names and categorical feature support and is built to operate with huge datasets.
Load the Iris Dataset and Split it into Training and Testing Datasets
Python
# Load the Iris dataset iris = load_iris() X = iris.data y = iris.target # Convert the target variable to binary classification (class 0 and class 1) y = (y = = 0 ).astype( int ) # Split the dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) |
This code imports the Iris dataset first, which consists of target labels (y) and features (X). Next, it changes the target labels to binary classification, encoding other classes as 0 and class 0 as 1. Ultimately, the dataset is divided into training and testing sets in order to assess the model.
Create CatBoost Pools for efficient Data Handling
Python
# Create CatBoost Pools for efficient data handling train_pool = Pool(data = X_train, label = y_train, cat_features = [], feature_names = iris.feature_names) test_pool = Pool(data = X_test, label = y_test, cat_features = [], feature_names = iris.feature_names) |
For effective data processing in the CatBoost classifier, these lines generate CatBoost Pools. The labels (y_train and y_test), empty categorical features (cat_features), and feature names from the Iris dataset are added, and the training and testing data (X_train and X_test) are transformed into a unique format appropriate for CatBoost. This enables effective CatBoost training and processing.
Defining CatBoost Parameters
Python
# Define CatBoost parameters params = { 'iterations' : 100 , 'depth' : 6 , 'learning_rate' : 0.1 , 'loss_function' : 'Logloss' , # Classification task 'custom_metric' : [ 'Accuracy' , 'AUC' ], # Additional metrics to track 'verbose' : 10 , # Print training progress every 10 iterations 'random_seed' : 42 # Set a random seed for reproducibility } |
These lines define a CatBoost classifier’s settings, including the number of boosting iterations, the depth of the ensemble’s trees, the learning rate, the classification loss function (Logloss), and extra metrics to monitor (Accuracy and AUC) during training. The frequency of progress printing is managed by the verbose parameter, and by establishing a random seed, random_seed guarantees reproducibility of results.
Train and Evaluate the CatBoost Model
Python
# Train the CatBoost classifier model = CatBoostClassifier( * * params) model.fit(train_pool, eval_set = test_pool) # Make predictions on the test set y_pred = model.predict(test_pool) # Calculate evaluation metrics accuracy = accuracy_score(y_test, y_pred) logloss = log_loss(y_test, model.predict_proba(test_pool)[:, 1 ]) roc_auc = roc_auc_score(y_test, model.predict_proba(test_pool)[:, 1 ]) # Print evaluation metrics print (f "Accuracy: {accuracy:.4f}" ) print (f "Log Loss: {logloss:.4f}" ) print (f "AUC: {roc_auc:.4f}" ) |
Output:
0: learn: 0.6333595 test: 0.6326569 best: 0.6326569 (0) total: 5.68ms remaining: 563ms
10: learn: 0.2973689 test: 0.2938201 best: 0.2938201 (10) total: 9.87ms remaining: 79.9ms
20: learn: 0.1637735 test: 0.1591490 best: 0.1591490 (20) total: 13.7ms remaining: 51.5ms
30: learn: 0.1051307 test: 0.1011177 best: 0.1011177 (30) total: 17.7ms remaining: 39.5ms
40: learn: 0.0715529 test: 0.0695287 best: 0.0695287 (40) total: 21.5ms remaining: 31ms
50: learn: 0.0533052 test: 0.0515575 best: 0.0515575 (50) total: 25.1ms remaining: 24.1ms
60: learn: 0.0416665 test: 0.0404120 best: 0.0404120 (60) total: 28.6ms remaining: 18.3ms
70: learn: 0.0342899 test: 0.0332187 best: 0.0332187 (70) total: 33.8ms remaining: 13.8ms
80: learn: 0.0294652 test: 0.0286255 best: 0.0286255 (80) total: 37.4ms remaining: 8.78ms
90: learn: 0.0256959 test: 0.0250120 best: 0.0250120 (90) total: 41.2ms remaining: 4.07ms
99: learn: 0.0230690 test: 0.0225294 best: 0.0225294 (99) total: 45.1ms remaining: 0us
bestTest = 0.02252943945
bestIteration = 99
Accuracy: 1.0000
Log Loss: 0.0225
AUC: 1.0000
This code assesses a CatBoost classifier’s performance on a test dataset after training it with given settings. It computes three evaluation measures, namely accuracy, log loss, and area under the ROC curve (AUC), while making predictions on the test set. Measuring the classifier’s overall accuracy, predictive quality, and discriminating power between classes, the metrics offer a thorough assessment of the model’s classification performance.
Classification Report
Python3
# Generate a classification report class_report = classification_report(y_test, y_pred) print ( "Classification Report:\n" , class_report) |
Output:
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 20
1 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
This function compares the predicted (y_pred) and actual (y_test) labels to provide a classification report that may be used to assess a model’s performance. The report is printed to the terminal and contains metrics for each class, including support, F1-score, precision, and recall.
CatBoost Tree Parameters
CatBoost is a popular gradient-boosting library known for its effectiveness in machine-learning competitions. It is particularly well-suited for tabular data and has several parameters that can be tuned to improve model performance. In this article, we will focus on CatBoost’s tree-related parameters and explore how they influence the model’s behaviour.