Implementing Stratified Sampling

Let us load the iris dataset to implement stratified sampling.

iris = datasets.load_iris(): Loads the famous Iris dataset from scikit-learn. This dataset contains measurements of sepal length, sepal width, petal length, and petal width for 150 iris flowers, representing three different species.
The value_counts() method provides a quick overview of the distribution of these classes in the dataset.

Python3

import pandas as pd
from sklearn import datasets
 
iris = datasets.load_iris()
iris_df=pd.DataFrame(iris.data)
 
iris_df['class']=iris.target
iris_df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
 
iris_df['class'].value_counts()

Let us use the scikit-learn’s train_test_split function from scikit-learn’s model_selection module to split a dataset into training and testing sets.

X and y: The data to be split into a training set and a test set.
train_size: Represents the proportion of the dataset to include in the training split. In this case, train_size=0.8 means 80% of the data will be used for training, and the remaining 20% will be used for testing.
random_state: If an integer value is provided, it ensures reproducibility by fixing the random seed. This means that the random split will be the same every time you run the code with the same random state stratify. If set to None, a different random split will be used each time.
shuffle: If set to True (which is the default), the data is shuffled before splitting. If set to False, the data is split in a stratified fashion without shuffling.
stratify: This is the column on which we want to stratify. Here we have set to the target variable (y). When stratify=y, it ensures that the class distribution in the training and test sets is similar to the original dataset.

Let us see class distribution when stratify is set to None.

Python3

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y, train_size=0.8, 
                                                   random_state=None, 
                                                   shuffle=True, stratify=None)
 
print("Class distribution of train set")
print(y_train.value_counts())
print()
print("Class distribution of test set")
print(y_test.value_counts())

Output:

 Class distribution of train set
0    43
2    40
1    37
Name: class, dtype: int64
Class distribution of test set
1    13
2    10
0     7
Name: class, dtype: int64

Let us see class distribution when stratify is set to True.

Python3

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y, train_size=0.8, 
                                                   random_state=None,
                                                   shuffle=True, stratify=y)
print(y_train.value_counts())
print(y_test.value_counts())

Output:

Class distribution of train set
0    40
2    40
1    40
Name: class, dtype: int64
Class distribution of test set
2    10
1    10
0    10
Name: class, dtype: int64

If we want to use stratified sampling with k fold, we can use StratifiedShuffleSplit class from Scikit Learn as below.

StratifiedShuffleSplit is a class in scikit-learn that provides a method for generating train/test indices for cross-validation. It is specifically designed for scenarios where you want to ensure that the class distribution in the dataset is maintained when splitting the data into training and testing sets.
n_splits: The number of re-shuffling and splitting iterations. In the example, n_splits=2 means the dataset will be split into 2 different train/test sets.
test_size : The proportion of the dataset to include in the test split. It can be a float (for example, 0.2 for 20%) or an integer (for example, 2 for 2 samples).
random_state: Seed for the random number generator to ensure reproducibility. If set to an integer, the same random splits will be generated each time.

Python3

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
skf = StratifiedShuffleSplit(n_splits=2, train_size = .8)
X = iris_df.iloc[:,:-1]
y = iris_df.iloc[:,-1]
 
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"Fold {i}:")
    print(f" {iris_df.iloc[train_index]['class'].value_counts()}")
    print("-"*10)
    print(f" {iris_df.iloc[test_index]['class'].value_counts()}")
    print("*" * 60)

Output:

Fold 0:
 2    40
1    40
0    40
Name: class, dtype: int64
----------
 2    10
1    10
0    10
Name: class, dtype: int64
************************************************************
Fold 1:
 2    40
1    40
0    40
Name: class, dtype: int64
----------
 2    10
0    10
1    10
Name: class, dtype: int64
************************************************************

Implementing Stratified Sampling

Python3

Python3

Python3

Python3

How to Implement Stratified Sampling with Scikit-Learn

Categories

Contact US

Implementing Stratified Sampling

Python3

Python3

Python3

Python3

How to Implement Stratified Sampling with Scikit-Learn

Similar Reads

Categories

Contact US