Visualizing K-Fold Cross-Validation Behavior

We can create a classification dataset and visualize the behaviour of K-Fold cross-validation. The code is as follows:

Python
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(
  n_samples=100, n_features=20, n_informative=15, n_redundant=5)

# prepare the K-Fold cross-validation procedure
n_splits = 10
cv = KFold(n_splits=n_splits)


Using the make_classification() method, we created a synthetic binary classification dataset of 100 samples with 20 features and prepared a K-Fold cross-validation procedure for the dataset with 10 folds. Then we displayed the training and test data for each fold. You can notice how the data is divided among the training and test sets for each fold.

Let’s visualise K-Fold cross validation behavior in Sklearn. The code is as follows:

Python
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import numpy as np

def plot_kfold(cv, X, y, ax, n_splits, xlim_max=100):
    """
    Plots the indices for a cross-validation object.

    Parameters:
    cv: Cross-validation object
    X: Feature set
    y: Target variable
    ax: Matplotlib axis object
    n_splits: Number of folds in the cross-validation
    xlim_max: Maximum limit for the x-axis
    """
    
    # Set color map for the plot
    cmap_cv = plt.cm.coolwarm
    cv_split = cv.split(X=X, y=y)
    
    for i_split, (train_idx, test_idx) in enumerate(cv_split):
        # Create an array of NaNs and fill in training/testing indices
        indices = np.full(len(X), np.nan)
        indices[test_idx], indices[train_idx] = 1, 0
        
        # Plot the training and testing indices
        ax_x = range(len(indices))
        ax_y = [i_split + 0.5] * len(indices)
        ax.scatter(ax_x, ax_y, c=indices, marker="_", 
                   lw=10, cmap=cmap_cv, vmin=-0.2, vmax=1.2)

    # Set y-ticks and labels
    y_ticks = np.arange(n_splits) + 0.5
    ax.set(yticks=y_ticks, yticklabels=range(n_splits),
           xlabel="X index", ylabel="Fold",
           ylim=[n_splits, -0.2], xlim=[0, xlim_max])

    # Set plot title and create legend
    ax.set_title("KFold", fontsize=14)
    legend_patches = [Patch(color=cmap_cv(0.8), label="Testing set"),
                      Patch(color=cmap_cv(0.02), label="Training set")]
    ax.legend(handles=legend_patches, loc=(1.03, 0.8))

# Create figure and axis
fig, ax = plt.subplots(figsize=(6, 3))
plot_kfold(cv, X, y, ax, n_splits)
plt.tight_layout()
fig.subplots_adjust(right=0.6)

Output

K-Fold

in the above code, we used matplotlib to visualize the sample plot for indices of a k-fold cross-validation object. We generated training or test visualizations for each CV split. Here, we filled the indices with training or test groups using Numpy and plotted the indices using the scatter() method. The cmap parameter specifies the color of the training and test sets, and the lw parameter sets the width of each fold. Finally, by using the set() method, we formatted the X and Y axes.

Cross-Validation Using K-Fold With Scikit-Learn

Cross-validation involves repeatedly splitting data into training and testing sets to evaluate the performance of a machine-learning model. One of the most commonly used cross-validation techniques is K-Fold Cross-Validation. In this article, we will explore the implementation of K-Fold Cross-Validation using Scikit-Learn, a popular Python machine-learning library.

Table of Content

  • What is K-Fold Cross Validation?
  • K-Fold With Scikit-Learn
  • Visualizing K-Fold Cross-Validation Behavior
  • Logistic Regression Model & K-Fold Cross Validating
  • Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)
  • Advantages & Disadvantages of K-Fold Cross Validation
  • Additional Information
  • Conclusions
  • Frequently Asked Questions (FAQs)

Similar Reads

What is K-Fold Cross Validation?

In K-Fold cross-validation, the input data is divided into ‘K’ number of folds, hence the name K Fold. The model undergoes training with K-1 folds and is evaluated on the remaining fold. This procedure is performed K times, where each fold is utilized as the testing set one time. The performance metrics are averaged across K iterations to offer a more reliable evaluation of the model’s performance....

K-Fold With Scikit-Learn

Let’s look at how to implement K-Fold cross-validation using Scikit-Learn. To achieve this, we need to import the KFold class from sklearn.model_selection. Let’s look at the KFold class from Scikit-Learn, its parameters, and its methods....

Visualizing K-Fold Cross-Validation Behavior

We can create a classification dataset and visualize the behaviour of K-Fold cross-validation. The code is as follows:...

Logistic Regression Model & K-Fold Cross Validating

Now let’s create a logistic regression model and cross-validate it using K-Fold. The code is as follows:...

Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)

Now it’s time to cross-validate different regression models using K-Fold, and we can analyze the performance of each model. Let’s make use of the California Housing dataset from Sklearn. The code is as follows:...

Advantages & Disadvantages of K-Fold Cross Validation

Advantages of K-Fold Cross Validation...

Additional Information

Apart from K-Fold cross-validation, there are a few other variations of K-Fold techniques. A few of them are:...

Conclusions

We have discussed the importance of K-Fold cross-validation technique in machine learning and gone through how it can be implemented using Sklearn. Hope you understood how K-fold methodology can increase model performance by avoiding overfitting and underfitting. We also analyzed the performance of different regression models, which helped us choose the most promising model for prediction....

Frequently Asked Questions (FAQs)

Q. What is K in K fold cross validation?...