Implementing Stratified Sampling
Let us load the iris dataset to implement stratified sampling.
- iris = datasets.load_iris(): Loads the famous Iris dataset from scikit-learn. This dataset contains measurements of sepal length, sepal width, petal length, and petal width for 150 iris flowers, representing three different species.
- The value_counts() method provides a quick overview of the distribution of these classes in the dataset.
Python3
import pandas as pd from sklearn import datasets iris = datasets.load_iris() iris_df = pd.DataFrame(iris.data) iris_df[ 'class' ] = iris.target iris_df.columns = [ 'sepal_len' , 'sepal_wid' , 'petal_len' , 'petal_wid' , 'class' ] iris_df[ 'class' ].value_counts() |
Let us use the scikit-learn’s train_test_split function from scikit-learn’s model_selection module to split a dataset into training and testing sets.
- X and y: The data to be split into a training set and a test set.
- train_size: Represents the proportion of the dataset to include in the training split. In this case, train_size=0.8 means 80% of the data will be used for training, and the remaining 20% will be used for testing.
- random_state: If an integer value is provided, it ensures reproducibility by fixing the random seed. This means that the random split will be the same every time you run the code with the same random state stratify. If set to None, a different random split will be used each time.
- shuffle: If set to True (which is the default), the data is shuffled before splitting. If set to False, the data is split in a stratified fashion without shuffling.
- stratify: This is the column on which we want to stratify. Here we have set to the target variable (y). When stratify=y, it ensures that the class distribution in the training and test sets is similar to the original dataset.
Let us see class distribution when stratify is set to None.
Python3
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8 , random_state = None , shuffle = True , stratify = None ) print ("Class distribution of train set ") print (y_train.value_counts()) print () print ("Class distribution of test set ") print (y_test.value_counts()) |
Output:
Class distribution of train set
0 43
2 40
1 37
Name: class, dtype: int64
Class distribution of test set
1 13
2 10
0 7
Name: class, dtype: int64
Let us see class distribution when stratify is set to True.
Python3
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8 , random_state = None , shuffle = True , stratify = y) print (y_train.value_counts()) print (y_test.value_counts()) |
Output:
Class distribution of train set
0 40
2 40
1 40
Name: class, dtype: int64
Class distribution of test set
2 10
1 10
0 10
Name: class, dtype: int64
If we want to use stratified sampling with k fold, we can use StratifiedShuffleSplit class from Scikit Learn as below.
- StratifiedShuffleSplit is a class in scikit-learn that provides a method for generating train/test indices for cross-validation. It is specifically designed for scenarios where you want to ensure that the class distribution in the dataset is maintained when splitting the data into training and testing sets.
- n_splits: The number of re-shuffling and splitting iterations. In the example, n_splits=2 means the dataset will be split into 2 different train/test sets.
- test_size : The proportion of the dataset to include in the test split. It can be a float (for example, 0.2 for 20%) or an integer (for example, 2 for 2 samples).
- random_state: Seed for the random number generator to ensure reproducibility. If set to an integer, the same random splits will be generated each time.
Python3
import numpy as np from sklearn.model_selection import StratifiedShuffleSplit skf = StratifiedShuffleSplit(n_splits = 2 , train_size = . 8 ) X = iris_df.iloc[:,: - 1 ] y = iris_df.iloc[:, - 1 ] for i, (train_index, test_index) in enumerate (skf.split(X, y)): print (f"Fold {i}:") print (f" {iris_df.iloc[train_index][ 'class' ].value_counts()}") print (" - " * 10 ) print (f" {iris_df.iloc[test_index][ 'class' ].value_counts()}") print (" * " * 60 ) |
Output:
Fold 0:
2 40
1 40
0 40
Name: class, dtype: int64
----------
2 10
1 10
0 10
Name: class, dtype: int64
************************************************************
Fold 1:
2 40
1 40
0 40
Name: class, dtype: int64
----------
2 10
0 10
1 10
Name: class, dtype: int64
************************************************************
How to Implement Stratified Sampling with Scikit-Learn
In this article, we will learn about How to Implement Stratified Sampling with Scikit-Learn.