Metrics for Classification
The goal of classification tasks is to categorize data points into distinct classes. CatBoost offers several metrics to assess model performance.
1. Accuracy
Instances successfully classified as a percentage of all instances is how accuracy is calculated. Despite being the most logical measurement, it may not be the most appropriate measurement for datasets with imbalances, where one class considerably dominates the other.
Python3
import numpy as np from catboost import CatBoostClassifier, Pool from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Load the Iris dataset iris = load_iris() X, y = iris.data, iris.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) # Create a CatBoostClassifier with 'MultiClass' loss function model = CatBoostClassifier(iterations = 100 , learning_rate = 0.1 , depth = 6 , loss_function = 'MultiClass' , verbose = 0 ) # Create a Pool object for the training and testing data train_pool = Pool(X_train, label = y_train) test_pool = Pool(X_test, label = y_test) # Train the model model.fit(train_pool) # Evaluate the model using CatBoost metrics metrics = model.eval_metrics(test_pool, metrics = [ 'Accuracy' ], plot = True ) # Print the evaluation metrics accuracy = metrics[ 'Accuracy' ][ - 1 ] print (f 'Accuracy: {accuracy:.2f}' ) |
Output:
Accuracy: 1.00
Since, iris dataset deals with classification, This is one of the suitable metric for evaluation.
Here, iris dataset from Scikit-learn datasets is loaded using ‘load_iris()’ function. The dataset is further split into train and test sets using ‘train_test_split()’ function. CatBoostClassification model is created using multiclass loss function as iris dataset is a multiclassification problem. Pool objects are created for train and test set. Then the model is trained on train_pool using ‘fit()’ function. Then the model is tested and evaluated on accuracy using test_pool and CatBoost’s ‘eval_metrics()’ function.
The output shows that the model has correctly predicted all of the instances in the dataset and the model is perfect fit for the dataset.
2. Multiclass Log Loss
Multiclass Log Loss, also known as cross-entropy for multiclass classification, is a variation of Log Loss designed for multiclass classification problems. This predicts a probability distribution over multiple classes and measures how well these predicted probabilities match the true class labels.
Python3
import numpy as np from catboost import CatBoostClassifier, Pool from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Load the Iris dataset iris = load_iris() X, y = iris.data, iris.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) # Create a CatBoostClassifier with 'MultiClass' loss function model = CatBoostClassifier(iterations = 100 , learning_rate = 0.1 , depth = 6 , loss_function = 'MultiClass' , verbose = 0 ) # Create a Pool object for the training and testing data train_pool = Pool(X_train, label = y_train) test_pool = Pool(X_test, label = y_test) # Train the model model.fit(train_pool) # Evaluate the model using appropriate metrics for multi-class classification metrics = model.eval_metrics(test_pool, metrics = [ 'MultiClass' ], plot = True ) # Print the evaluation metrics multi_class_loss = metrics[ 'MultiClass' ][ - 1 ] print (f 'Multi-Class Loss: {multi_class_loss:.2f}' ) |
Output:
Multi-Class Loss: 0.03
A multi-class loss value of 0.03 suggests that the model is performing well in terms of multi-class classification on the test dataset.
3. Binary Log Loss
Log Loss (cross-entropy loss), quantifies the dissimilarity between the predicted probabilities and the true labels. Lower log loss values indicate better performance. This metric is particularly useful when there’s a need for well-calibrated estimates. Can be used in applications like fraud detection or medical diagnosis, where better calibration of probabilities becomes crucial. It is often referred to in the context of binary classification, i.e, only two classes present in the dataset.
The Iris dataset has three classes, hence it is not appropriate for this metric. Therefore, let’s use the Breast Cancer dataset, which can has only two classes i.e, presence or absence of breast cancer.
Python3
import numpy as np from catboost import CatBoostClassifier, Pool from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split # Load the Breast Cancer dataset data = load_breast_cancer() X, y = data.data, data.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) # Create a CatBoostClassifier model = CatBoostClassifier(iterations = 100 , learning_rate = 0.1 , depth = 6 , verbose = 0 ) # Create a Pool object for the training and testing data train_pool = Pool(X_train, label = y_train) test_pool = Pool(X_test, label = y_test) # Train the model model.fit(train_pool) # Evaluate the model using CatBoost's log loss and F1 score metrics = model.eval_metrics(test_pool, metrics = [ 'Logloss' ], plot = False ) # Print the evaluation metrics logloss = metrics[ 'Logloss' ][ - 1 ] print (f 'Log Loss (Cross-Entropy): {logloss:.2f}' ) |
Output:
Log Loss (Cross-Entropy): 0.08
It quantifies how well the model’s predicted probabilities match the true class labels on the validation set. A lower log loss of 0.08 indicates better alignment between predictions and actual labels.
4. AUC-ROC and AUC-PRC
Area Under the Receiver Operating Characteristic Curve (AUR-ROC) and Area Under the Precision-Recall Curve (AUC-PRC) are very important for binary classification models. AUC-ROC measures the model’s ability to distinguish between positive and negative classes, while AUC-PRC emphasizes precision and recall trade-offs.
Python3
import catboost from catboost import CatBoostClassifier, Pool from sklearn import datasets from sklearn.model_selection import train_test_split # Load the Iris dataset as an example (binary classification problem) iris = datasets.load_iris() X = iris.data y = iris.target # Convert to binary classification by mapping # class 2 to class 1 (positive class) y_binary = (y = = 2 ).astype( int ) # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size = 0.2 , random_state = 42 ) # Create a CatBoost classifier with AUC-ROC metric model = CatBoostClassifier(iterations = 500 , random_seed = 42 , eval_metric = 'AUC' ) # Convert the training data into a CatBoost Pool train_pool = Pool(X_train, label = y_train) # Train the model model.fit(train_pool, verbose = 100 ) # can also obtain the AUC-ROC, AUC-PR value on the # validation set (or testing set) during training validation_pool = Pool(X_test, label = y_test) eval_result = model.eval_metrics(validation_pool, [ 'AUC' ])[ 'AUC' ] metrics = model.eval_metrics(validation_pool, metrics = [ 'PRAUC' ], plot = True ) auc_pr = metrics[ 'PRAUC' ][ - 1 ] # Print the evaluation metrics print (f 'AUC-PR: {auc_pr:.2f}' ) print (f "AUC-ROC: {eval_result[-1]:.4f}" ) |
Output:
earning rate set to 0.007867
0: total: 789us remaining: 394ms
100: total: 154ms remaining: 608ms
200: total: 308ms remaining: 458ms
300: total: 505ms remaining: 334ms
400: total: 667ms remaining: 165ms
499: total: 785ms remaining: 0us
AUC-PR: 1.00
AUC-ROC: 1.0000
The model is trained 500 times (iterations=500). The CatBoost automatically calculates and monitors the specified evaluation metric (‘AUC’) during training. The evaluation is performed on a separate validation set (or testing set) at each iteration, allowing to track and report the model’s performance. ‘AUC-ROC’ focuses on true positive rate vs. false positive rate and ‘AUC-PR’ focuses on precision vs. recall.
5. F1 Score
The model’s accuracy (how well it predicts a category) and recall (how frequently it was able to identify that category) are combined to create the F1 Score, which is the harmonic mean. This statistic is ideal for balancing the trade-off between false positives and false negatives. Higher F1 Scores indicate superior models.
Python3
import numpy as np from catboost import CatBoostClassifier, Pool from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split # Load the Breast Cancer dataset data = load_breast_cancer() X, y = data.data, data.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) # Create a CatBoostClassifier model = CatBoostClassifier(iterations = 100 , learning_rate = 0.1 , depth = 6 , verbose = 0 ) # Create a Pool object for the training and testing data train_pool = Pool(X_train, label = y_train) test_pool = Pool(X_test, label = y_test) # Train the model model.fit(train_pool) # Evaluate the model using CatBoost's log loss and F1 score metrics = model.eval_metrics(test_pool, metrics = [ 'F1' ], plot = True ) # Print the evaluation metrics f1 = metrics[ 'F1' ][ - 1 ] print (f 'F1 Score: {f1:.2f}' ) |
Output:
F1 Score: 0.98
A categorization statistic called the F1 Score combines recall and precision into one numerical score. The model gets an F1 score of 0.98, suggesting that it fits the dataset the best.
6. Precision
Precision measures the ability of the model to make positive predictions correctly. Ratio of true positive predictions to all positive predictions made by the model is precision.
Python3
import numpy as np from catboost import CatBoostClassifier, Pool from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split # Load the Breast Cancer dataset data = load_breast_cancer() X, y = data.data, data.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) # Create a CatBoostClassifier model = CatBoostClassifier(iterations = 100 , learning_rate = 0.1 , depth = 6 , verbose = 0 ) # Create a Pool object for the training and testing data train_pool = Pool(X_train, label = y_train) test_pool = Pool(X_test, label = y_test) # Train the model model.fit(train_pool) # Evaluate the model using CatBoost's precision metrics = model.eval_metrics(test_pool, metrics = [ 'Precision' ], plot = True ) # Print the precision metric precision = metrics[ 'Precision' ][ - 1 ] print (f 'Precision: {precision:.2f}' ) |
Output:
Precision: 0.97
This means that 97% of the positive predictions made by the model are actually positive.
There are many other classification metrics supported by CatBoost for both binary and multiclassification, they include:
- Recall (‘Recall’ or ‘TruePositiveRate’): It is the ratio of true positive predictions to all positive instances. It measures the ability of the model to correctly identify positive instances.)
- Weighted Metrics (‘WeightedF1’, ‘WeightedPrecision’, ‘WeightedRecall’, ‘WeightedSpecificity’): These metrics are similar to F1, Precision, Recall, and Specificity, respectively, but can be weighted based on class importance, making them suitable for class-imbalanced problems.
- Kappa Score (‘Kappa’): The Kappa Score measures the agreement between predicted and actual classes while adjusting for chance. It’s useful for assessing classification models when class distribution is imbalanced., etc.
CatBoost Metrics for model evaluation
To make sure our model’s performance satisfies evolving expectations and criteria, proper evaluation is crucial when it comes to machine learning model construction. Yandex’s CatBoost is a potent gradient-boosting library that gives machine learning practitioners and data scientists a toolbox of measures for evaluating model performance.
Table of Content
- CatBoost
- CatBoost Metrics
- Metrics for Classification
- Metrics for Regression
- Metrics for Over-fitting Detection
- Metric for Hyperparameter Tuning