Implementation of Cross-validation for hyperparameter tuning in Catboost
Installing required module
At first, we need to install CatBoost module to our runtime.
!pip install catboost
Importing required libraries
Now we will import all required Python libraries like Pandas, NumPy, Matplotlib, Seaborn, SKlearn etc.
Python3
import pandas as pd from catboost import CatBoostClassifier, Pool from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, f1_score from sklearn.model_selection import StratifiedKFold from tabulate import tabulate import seaborn as sns import matplotlib.pyplot as plt |
Loading Dataset
Now we will load the Titanic dataset and select relevant features for this implementation and create a list of categorial features which will be feed to the model later on.
Python3
# Load the Titanic dataset url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv" df = pd.read_csv(url) # Select relevant features and target variable X = df[[ 'Pclass' , 'Sex' , 'Age' , 'Fare' ]] y = df[ 'Survived' ] # define categorial features of dataset cat_features = [ 'Pclass' , 'Sex' ] # List of categorical features |
This code creates a DataFrame named df and loads the Titanic dataset from a specified URL. Following that, it chooses particular features, such as “Pclass,” “Sex,” “Age,” and “Fare,” and assigns them to the variable X. The variable y contains the desired variable “Survived.” Furthermore, ‘Pclass’ and ‘Sex’ are two categories of features that the code defines as cat_features. Machine learning models may make use of these categorical features to handle categorical data in an effective manner.
Exploratory Data Analysis
Exploratory Data Analysis(EDA) helps us to gain deeper insights about the dataset.Exploratory Data Analysis (EDA) is a critical initial step in data analysis to summarize the main characteristics of a dataset, often using visual methods. It involves uncovering patterns, understanding underlying structures, identifying anomalies, and testing assumptions within the dataset.
Correlation Matrix
Visualizing correlation matrix will help us to understand how the features of the dataset is correlated to each other.
Python3
# Visualize correlation matrix correlation_matrix = df.corr(numeric_only = True ) plt.figure(figsize = ( 6 , 4 )) sns.heatmap(correlation_matrix, annot = True , cmap = 'coolwarm' ) plt.title( "Correlation Matrix" ) plt.show() |
Output:
The corr method is used in this code to compute the correlation matrix of the numerical characteristics in the Titanic dataset. After that, a heatmap is made to show the correlations using the Seaborn library. The correlation values are shown on the heatmap with the annot=True parameter, and the color map “coolwarm” is used to visualize the data.
Cross-Validation Settings
Python3
# Define a range of hyperparameter values to search through iterations_values = [ 100 , 200 , 300 ] depth_values = [ 6 , 8 , 10 ] learning_rate_values = [ 0.1 , 0.05 , 0.01 ] best_score = 0 # Initialize the best score best_params = {} # Initialize the best hyperparameters # Define cross-validation settings cv = StratifiedKFold(n_splits = 5 , shuffle = True , random_state = 42 ) |
Now we will initialize the Stratified K-fold Cross-validation. We will perform cross-validation on three hyperparameters of the CatBoost model which are discussed below:
- iterations: This parameter is used to specify the number of boosting iterations which corresponds to the number of decision trees to be built. It controls the complexity of the model.
- depth: It defines the depth of each decision tree in the ensemble. A higher depth allows the model to capture more complex relationships present in the data. But it may lead to overfitting if it is set too high.
- learning_rate: The learning rate determines the step size for gradient descent during model training. A smaller learning rate can help prevent overfitting but may require more iterations to converge.
This code defines a range of hyperparameter values for a machine learning model. Three lists are specifically made: learning_rate_values with values [0.1, 0.05, 0.01], depth_values with values [6, 8, 10], and iterations_values with values [100, 200, 300]. These lists show various model hyperparameter choices. To maintain track of the best hyperparameter values discovered throughout the search, the code initializes best_score to 0 and best_params as an empty dictionary.
It also configures cross-validation with StratifiedKFold with 5 splits, data shuffles, and a random state of 42. During the hyperparameter tuning process, these values will be used to assess how well the model performs with various combinations of hyperparameters.
Tuning Loop
Python3
# Initialize a list to store tuning progress tuning_progress = [] # Perform hyperparameter tuning with cross-validation for iterations in iterations_values: for depth in depth_values: for learning_rate in learning_rate_values: # Create a CatBoost model with the current hyperparameters model = CatBoostClassifier(iterations = iterations, depth = depth, learning_rate = learning_rate, cat_features = cat_features, verbose = False ) # Perform cross-validation and get the mean F1 score f1_scores = [] for train_index, val_index in cv.split(X, y): X_train, X_val = X.iloc[train_index], X.iloc[val_index] y_train, y_val = y.iloc[train_index], y.iloc[val_index] model.fit(X_train, y_train) y_pred = model.predict(X_val) f1 = f1_score(y_val, y_pred) f1_scores.append(f1) mean_f1 = sum (f1_scores) / len (f1_scores) # Update the best hyperparameters if a better score is found if mean_f1 > best_score: best_score = mean_f1 best_params = { 'iterations' : iterations, 'depth' : depth, 'learning_rate' : learning_rate } # Append the progress to the list tuning_progress.append({ 'Iterations' : iterations, 'Depth' : depth, 'Learning Rate' : learning_rate, 'F1 Score' : mean_f1 }) |
- This code uses cross-validation to optimize the hyperparameter combinations for a CatBoostClassifier. Here is a thorough description of the code:
- In order to record the progress of hyperparameter tuning, including the F1 scores for various combinations of hyperparameters, tuning_progress is first created as an empty list. The specified hyperparameter values are iterated through via a nested loop: depth loops over values [6, 8, 10]. iterations loop through values [100, 200, 300]. learning_rate loop with values [0.05, 0.01, 0.1].
- A CatBoostClassifier model is built using the current hyperparameters inside the stacked loops. It provides the tree depth, learning rate, number of iterations, and categorical characteristics defined in cat_features. To prevent overproduction, the model is configured to be non-verbose.
- The F1 score is used in cross-validation to assess the model’s performance. The previously defined StratifiedKFold settings are used to divide the data into training and validation sets. The model is trained on the training data (X_train and y_train) for each cross-validation fold, and it is then used to predict the validation data (X_val). Every fold’s F1 score is determined and kept in the f1_scores list.
- By averaging the F1 values from each cross-validation fold, the mean F1 score is calculated. This shows how well the model performs with the available hyperparameters.
- The hyperparameters and the score are changed in best_params and best_score, respectively, if the mean F1 score is better (higher) than the best_score observed thus far.
- The list tuning_progress contains the current catboost hyperparameters (iterations, depth, and learning rate) along with the corresponding mean F1 score, indicating the progress of the hyperparameter tuning.
Visualization of Tuning Progress
Now we will visualize the tuning progress and extract the best set values of hyperparameters.
Python3
# Print the tuning progress in a table print (tabulate(tuning_progress, headers = 'keys' , tablefmt = 'pretty' )) # Print the best hyperparameters and F1 score print ( "Best Hyperparameters:" , best_params) |
Output:
+------------+-------+---------------+--------------------+
| Iterations | Depth | Learning Rate | F1 Score |
+------------+-------+---------------+--------------------+
| 100 | 6 | 0.1 | 0.7322687453324533 |
| 100 | 6 | 0.05 | 0.7182202485927607 |
| 100 | 6 | 0.01 | 0.7158252814552029 |
| 100 | 8 | 0.1 | 0.740413070492519 |
| 100 | 8 | 0.05 | 0.7273177220983926 |
| 100 | 8 | 0.01 | 0.7130408178567857 |
| 100 | 10 | 0.1 | 0.7421390513453284 |
| 100 | 10 | 0.05 | 0.7227720134780492 |
| 100 | 10 | 0.01 | 0.714975371850372 |
| 200 | 6 | 0.1 | 0.7691377296011834 |
| 200 | 6 | 0.05 | 0.7455641270757373 |
| 200 | 6 | 0.01 | 0.7152601973003904 |
| 200 | 8 | 0.1 | 0.7721211161834263 |
| 200 | 8 | 0.05 | 0.7562464661771585 |
| 200 | 8 | 0.01 | 0.726428330128534 |
| 200 | 10 | 0.1 | 0.782297131444335 |
| 200 | 10 | 0.05 | 0.7702156025135478 |
| 200 | 10 | 0.01 | 0.723850994293308 |
| 300 | 6 | 0.1 | 0.7669845192385327 |
| 300 | 6 | 0.05 | 0.7658800713486457 |
| 300 | 6 | 0.01 | 0.721959083713663 |
| 300 | 8 | 0.1 | 0.7743234942561392 |
| 300 | 8 | 0.05 | 0.7687311053890516 |
| 300 | 8 | 0.01 | 0.7276006304501964 |
| 300 | 10 | 0.1 | 0.7742531262139627 |
| 300 | 10 | 0.05 | 0.7655710411482259 |
| 300 | 10 | 0.01 | 0.7364337989615153 |
+------------+-------+---------------+--------------------+
Best Hyperparameters: {'iterations': 200, 'depth': 10, 'learning_rate': 0.1}
This code uses the tabulate function to first show the hyperparameter adjustment progress in a table format. To aid in the visualisation of the tuning process, this table displays various combinations of hyperparameters together with the related F1 scores.
The accompanying F1 score and the optimal hyperparameters (best_params) are then printed. The configuration of the CatBoost model that yields the highest F1 score during tuning is represented by these ideal hyperparameters. This data is essential for comprehending the model’s functionality and directing future model implementation.
Evaluation of the best model
We have already extracted the best parameters. Now we will feed them to the model and check its performance.
Python3
# Train the model on best parameters best_model = CatBoostClassifier( * * best_params, cat_features = cat_features, verbose = False ) best_model.fit(X, y) # Make predictions y_pred = best_model.predict(X) # Calculate accuracy and F1 score for best model accuracy = accuracy_score(y, y_pred) f1 = f1_score(y, y_pred) print ( " Accuracy:" , accuracy) print ( " F1 Score:" , f1) |
Output:
Accuracy: 0.9267192784667418
F1 Score: 0.8995363214837713
This function uses the hyperparameters (best_params) that were acquired throughout the hyperparameter tuning procedure to build the optimal CatBoost model. ‘False’ verbosity and the specified category characteristics are used to train the model in a silent manner. Then, using the same dataset for predictions, we fit the best model to the complete dataset.
The accuracy and F1 score for this top model are then computed and printed. The accuracy measures the percentage of properly predicted instances, while the F1 score provides an overall evaluation of the model’s performance in classification tasks by balancing precision and recall.
Conclusion
We can conclude that, hyperparameter tuning is a very important task to achieve higher performance of any model. Here, after hyperparameter tuning our model achived a notable 92.67% of accuracy and outstanding 89.95% of F1-score. But there is more little space of improvement. In that case, we can employ more hyperparameters for tuning. Also in real-world large datasets it is crucial to employ as much as hyperparameters to obtain mostly correct set of values of hyperparameters.
CatBoost Cross-Validation and Hyperparameter Tuning
CatBoost is a powerful gradient-boosting algorithm of machine learning that is very popular for its effective capability to handle categorial features of both classification and regression tasks. To maximize the potential of CatBoost, it’s essential to fine-tune its hyperparameters which can be done by Cross-validation. Cross-validation is a crucial technique that allows data scientists and machine learning practitioners to rigorously assess the model’s performance under different parameter configuration sets and select the most optimal hyperparameters. In this article, we are going to discuss how we can tune the hyper-parameters of CatBoost using cross-validation.