Metrics for Classification

The goal of classification tasks is to categorize data points into distinct classes. CatBoost offers several metrics to assess model performance.

1. Accuracy

Instances successfully classified as a percentage of all instances is how accuracy is calculated. Despite being the most logical measurement, it may not be the most appropriate measurement for datasets with imbalances, where one class considerably dominates the other.

Python3

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
 
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2, 
                                                    random_state=42)
 
# Create a CatBoostClassifier with 'MultiClass' loss function
model = CatBoostClassifier(iterations=100, 
                           learning_rate=0.1, 
                           depth=6, 
                           loss_function='MultiClass', 
                           verbose=0)
 
# Create a Pool object for the training and testing data
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)
 
# Train the model
model.fit(train_pool)
 
# Evaluate the model using CatBoost metrics
metrics = model.eval_metrics(test_pool, 
                             metrics=['Accuracy'],
                             plot=True)
 
# Print the evaluation metrics
accuracy = metrics['Accuracy'][-1]
 
print(f'Accuracy: {accuracy:.2f}')

Output:

Accuracy: 1.00

Since, iris dataset deals with classification, This is one of the suitable metric for evaluation.

Here, iris dataset from Scikit-learn datasets is loaded using ‘load_iris()’ function. The dataset is further split into train and test sets using ‘train_test_split()’ function. CatBoostClassification model is created using multiclass loss function as iris dataset is a multiclassification problem. Pool objects are created for train and test set. Then the model is trained on train_pool using ‘fit()’ function. Then the model is tested and evaluated on accuracy using test_pool and CatBoost’s ‘eval_metrics()’ function.

The output shows that the model has correctly predicted all of the instances in the dataset and the model is perfect fit for the dataset.

2. Multiclass Log Loss

Multiclass Log Loss, also known as cross-entropy for multiclass classification, is a variation of Log Loss designed for multiclass classification problems. This predicts a probability distribution over multiple classes and measures how well these predicted probabilities match the true class labels.

Python3

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
 
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
 
# Create a CatBoostClassifier with 'MultiClass' loss function
model = CatBoostClassifier(iterations=100, 
                           learning_rate=0.1, 
                           depth=6, 
                           loss_function='MultiClass', 
                           verbose=0)
 
# Create a Pool object for the training and testing data
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)
 
# Train the model
model.fit(train_pool)
 
# Evaluate the model using appropriate metrics for multi-class classification
metrics = model.eval_metrics(test_pool,
                             metrics=['MultiClass'], 
                             plot = True)
 
# Print the evaluation metrics
multi_class_loss = metrics['MultiClass'][-1]
 
print(f'Multi-Class Loss: {multi_class_loss:.2f}')

Output:

Multi-Class Loss: 0.03

A multi-class loss value of 0.03 suggests that the model is performing well in terms of multi-class classification on the test dataset.

3. Binary Log Loss

Log Loss (cross-entropy loss), quantifies the dissimilarity between the predicted probabilities and the true labels. Lower log loss values indicate better performance. This metric is particularly useful when there’s a need for well-calibrated estimates. Can be used in applications like fraud detection or medical diagnosis, where better calibration of probabilities becomes crucial. It is often referred to in the context of binary classification, i.e, only two classes present in the dataset.

The Iris dataset has three classes, hence it is not appropriate for this metric. Therefore, let’s use the Breast Cancer dataset, which can has only two classes i.e, presence or absence of breast cancer.

Python3

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
 
# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
 
# Create a CatBoostClassifier
model = CatBoostClassifier(iterations=100, 
                           learning_rate=0.1, 
                           depth=6, 
                           verbose=0)
 
# Create a Pool object for the training and testing data
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)
 
# Train the model
model.fit(train_pool)
 
# Evaluate the model using CatBoost's log loss and F1 score
metrics = model.eval_metrics(test_pool, 
                             metrics=['Logloss'], 
                             plot =False)
 
# Print the evaluation metrics
logloss = metrics['Logloss'][-1]
 
print(f'Log Loss (Cross-Entropy): {logloss:.2f}')

Output:

Log Loss (Cross-Entropy): 0.08

It quantifies how well the model’s predicted probabilities match the true class labels on the validation set. A lower log loss of 0.08 indicates better alignment between predictions and actual labels.

4. AUC-ROC and AUC-PRC

Area Under the Receiver Operating Characteristic Curve (AUR-ROC) and Area Under the Precision-Recall Curve (AUC-PRC) are very important for binary classification models. AUC-ROC measures the model’s ability to distinguish between positive and negative classes, while AUC-PRC emphasizes precision and recall trade-offs.

Python3

import catboost
from catboost import CatBoostClassifier, Pool
from sklearn import datasets
from sklearn.model_selection import train_test_split
 
# Load the Iris dataset as an example (binary classification problem)
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Convert to binary classification by mapping 
# class 2 to class 1 (positive class)
y_binary = (y == 2).astype(int)
 
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, 
                                                    test_size=0.2, 
                                                    random_state=42)
 
# Create a CatBoost classifier with AUC-ROC metric
model = CatBoostClassifier(iterations=500, 
                           random_seed=42, 
                           eval_metric='AUC')
 
# Convert the training data into a CatBoost Pool
train_pool = Pool(X_train, label=y_train)
 
# Train the model
model.fit(train_pool, verbose=100)
 
# can also obtain the AUC-ROC, AUC-PR value on the
# validation set (or testing set) during training
validation_pool = Pool(X_test, label=y_test)
eval_result = model.eval_metrics(validation_pool, 
                                 ['AUC'])['AUC']
metrics = model.eval_metrics(validation_pool, 
                             metrics=['PRAUC'],
                             plot = True)
auc_pr = metrics['PRAUC'][-1]
 
# Print the evaluation metrics
 
print(f'AUC-PR: {auc_pr:.2f}')
 
print(f"AUC-ROC: {eval_result[-1]:.4f}")

Output:

earning rate set to 0.007867
0:    total: 789us    remaining: 394ms
100:    total: 154ms    remaining: 608ms
200:    total: 308ms    remaining: 458ms
300:    total: 505ms    remaining: 334ms
400:    total: 667ms    remaining: 165ms
499:    total: 785ms    remaining: 0us

AUC-PR: 1.00
AUC-ROC: 1.0000

The model is trained 500 times (iterations=500). The CatBoost automatically calculates and monitors the specified evaluation metric (‘AUC’) during training. The evaluation is performed on a separate validation set (or testing set) at each iteration, allowing to track and report the model’s performance. ‘AUC-ROC’ focuses on true positive rate vs. false positive rate and ‘AUC-PR’ focuses on precision vs. recall.

5. F1 Score

The model’s accuracy (how well it predicts a category) and recall (how frequently it was able to identify that category) are combined to create the F1 Score, which is the harmonic mean. This statistic is ideal for balancing the trade-off between false positives and false negatives. Higher F1 Scores indicate superior models.

Python3

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
 
# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
 
# Create a CatBoostClassifier
model = CatBoostClassifier(iterations=100, 
                           learning_rate=0.1, 
                           depth=6, 
                           verbose=0)
 
# Create a Pool object for the training and testing data
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)
 
# Train the model
model.fit(train_pool)
 
# Evaluate the model using CatBoost's log loss and F1 score
metrics = model.eval_metrics(test_pool, 
                             metrics=['F1'], 
                             plot=True)
 
# Print the evaluation metrics
f1 = metrics['F1'][-1]
 
print(f'F1 Score: {f1:.2f}')

Output:

F1 Score: 0.98

A categorization statistic called the F1 Score combines recall and precision into one numerical score. The model gets an F1 score of 0.98, suggesting that it fits the dataset the best.

6. Precision

Precision measures the ability of the model to make positive predictions correctly. Ratio of true positive predictions to all positive predictions made by the model is precision.

Python3

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
 
# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
 
# Create a CatBoostClassifier
model = CatBoostClassifier(iterations=100, 
                           learning_rate=0.1, 
                           depth=6, 
                           verbose=0)
 
# Create a Pool object for the training and testing data
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)
 
# Train the model
model.fit(train_pool)
 
# Evaluate the model using CatBoost's precision
metrics = model.eval_metrics(test_pool, 
                             metrics=['Precision'], 
                             plot=True)
 
# Print the precision metric
precision = metrics['Precision'][-1]
 
print(f'Precision: {precision:.2f}')

Output:

Precision: 0.97

This means that 97% of the positive predictions made by the model are actually positive.

There are many other classification metrics supported by CatBoost for both binary and multiclassification, they include:

Recall (‘Recall’ or ‘TruePositiveRate’): It is the ratio of true positive predictions to all positive instances. It measures the ability of the model to correctly identify positive instances.)
Weighted Metrics (‘WeightedF1’, ‘WeightedPrecision’, ‘WeightedRecall’, ‘WeightedSpecificity’): These metrics are similar to F1, Precision, Recall, and Specificity, respectively, but can be weighted based on class importance, making them suitable for class-imbalanced problems.
Kappa Score (‘Kappa’): The Kappa Score measures the agreement between predicted and actual classes while adjusting for chance. It’s useful for assessing classification models when class distribution is imbalanced., etc.

1. Accuracy

Python3

2. Multiclass Log Loss

Python3

3. Binary Log Loss

Python3

4. AUC-ROC and AUC-PRC

Python3

5. F1 Score

Python3

6. Precision

Python3

CatBoost Metrics for model evaluation

Categories

Contact US

Metrics for Classification

1. Accuracy

Python3

2. Multiclass Log Loss

Python3

3. Binary Log Loss

Python3

4. AUC-ROC and AUC-PRC

Python3

5. F1 Score

Python3

6. Precision

Python3

CatBoost Metrics for model evaluation

Similar Reads

Categories

Contact US