Implementing CatBoost Embedding on Synthetic data
Here, we will generate synthetic data and then implement catboost to it.
Step 1: Importing Libraries
First, we need to import the necessary Python libraries. We’ll need CatBoost for the machine learning model, NumPy for data manipulation, and Matplotlib for visualization.
import numpy as np
from catboost import CatBoostClassifier, Pool
import matplotlib.pyplot as plt
Step 2: Generating a Synthetic Dataset
Using NumPy, we will generate a fictitious dataset, that will enable us to illustrate the procedure without requiring outside data. np.random.rand generates random numbers for features, and np.random.randint generates binary labels and we get a dataset including 100 samples with two features and a binary label is available.
# Set a random seed for reproducibility
np.random.seed(0)
# Generate synthetic features and labels
X = np.random.rand(100, 2)
y = np.random.randint(0, 2, 100)
Step 3: Visualizing the Dataset
To comprehend the structure of our data, it is useful to visualize it before moving forward, at first we make a scatter plot, using the scatter function in Matplotlib. then we color the dots according to their labels.
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
plt.title('Synthetic Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Output:
Step 4: Preparing the Data for CatBoost
Data in the Pool format, a data structure that, effectively handles both numerical and category information, is required for CatBoost.We provide the Pool constructor our labels (y) and features (X) and now our data is now prepared correctly for CatBoost training.
# Create a Pool object
train_pool = Pool(data=X, label=y)
Step 5: Training the CatBoost Model
We will now use the artificial dataset to define and train our CatBoost classifier.
# Initialize the CatBoost classifier
model = CatBoostClassifier(iterations=100, depth=2, learning_rate=1, loss_function='Logloss')
# Train the model
model.fit(train_pool, verbose=False)
Output:
<catboost.core.CatBoostClassifier at 0x7ca3a84ac040>
Step 6: Evaluating the Model
After training, we should evaluate our model’s performance to see how well it learned from the dataset.
# Make predictions
predictions = model.predict(X)
# Calculate accuracy
accuracy = np.sum(predictions.flatten() == y) / len(y)
print(f'Accuracy: {accuracy:.2f}')
Output:
Accuracy: 0.99
Step 7: Visualizing the Model’s Decision Boundary
Finally, let’s visualize the decision boundary created by our model.
# Create a grid of points
xx, yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.title('Model Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Output:
CatBoost Embedding Features
The capacity to convert raw data into a format that computers can understand is essential in the field of machine learning. The machine learning community has been using CatBoost, a robust gradient boosting toolkit, more and more because of its ease of handling categorical information. CatBoost is a machine learning technique that belongs to the gradient-boosting family of algorithms and is particularly good at, handling categorical data. One of its many features is CatBoost Embeddings, a process, that can improve your models’ predictive power, particularly when working with categorical data. We will look at the idea of CatBoost Embeddings in this article, explaining its importance, how it works, and how it affects model performance.