Implementing Stratified Sampling

Let us load the iris dataset to implement stratified sampling.

  • iris = datasets.load_iris(): Loads the famous Iris dataset from scikit-learn. This dataset contains measurements of sepal length, sepal width, petal length, and petal width for 150 iris flowers, representing three different species.
  • The value_counts() method provides a quick overview of the distribution of these classes in the dataset.

Python3




import pandas as pd
from sklearn import datasets
 
iris = datasets.load_iris()
iris_df=pd.DataFrame(iris.data)
 
iris_df['class']=iris.target
iris_df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
 
iris_df['class'].value_counts()


Let us use the scikit-learn’s train_test_split function from scikit-learn’s model_selection module to split a dataset into training and testing sets.

  • X and y: The data to be split into a training set and a test set.
  • train_size: Represents the proportion of the dataset to include in the training split. In this case, train_size=0.8 means 80% of the data will be used for training, and the remaining 20% will be used for testing.
  • random_state: If an integer value is provided, it ensures reproducibility by fixing the random seed. This means that the random split will be the same every time you run the code with the same random state stratify. If set to None, a different random split will be used each time.
  • shuffle: If set to True (which is the default), the data is shuffled before splitting. If set to False, the data is split in a stratified fashion without shuffling.
  • stratify: This is the column on which we want to stratify. Here we have set to the target variable (y). When stratify=y, it ensures that the class distribution in the training and test sets is similar to the original dataset.

Let us see class distribution when stratify is set to None.

Python3




from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y, train_size=0.8,
                                                   random_state=None,
                                                   shuffle=True, stratify=None)
 
print("Class distribution of train set")
print(y_train.value_counts())
print()
print("Class distribution of test set")
print(y_test.value_counts())


Output:

 Class distribution of train set
0    43
2    40
1    37
Name: class, dtype: int64
Class distribution of test set
1    13
2    10
0     7
Name: class, dtype: int64

Let us see class distribution when stratify is set to True.

Python3




from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y, train_size=0.8,
                                                   random_state=None,
                                                   shuffle=True, stratify=y)
print(y_train.value_counts())
print(y_test.value_counts())


Output:

Class distribution of train set
0    40
2    40
1    40
Name: class, dtype: int64
Class distribution of test set
2    10
1    10
0    10
Name: class, dtype: int64

If we want to use stratified sampling with k fold, we can use StratifiedShuffleSplit class from Scikit Learn as below.

  • StratifiedShuffleSplit is a class in scikit-learn that provides a method for generating train/test indices for cross-validation. It is specifically designed for scenarios where you want to ensure that the class distribution in the dataset is maintained when splitting the data into training and testing sets.
  • n_splits: The number of re-shuffling and splitting iterations. In the example, n_splits=2 means the dataset will be split into 2 different train/test sets.
  • test_size : The proportion of the dataset to include in the test split. It can be a float (for example, 0.2 for 20%) or an integer (for example, 2 for 2 samples).
  • random_state: Seed for the random number generator to ensure reproducibility. If set to an integer, the same random splits will be generated each time.

Python3




import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
skf = StratifiedShuffleSplit(n_splits=2, train_size = .8)
X = iris_df.iloc[:,:-1]
y = iris_df.iloc[:,-1]
 
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"Fold {i}:")
    print(f" {iris_df.iloc[train_index]['class'].value_counts()}")
    print("-"*10)
    print(f" {iris_df.iloc[test_index]['class'].value_counts()}")
    print("*" * 60)


Output:

Fold 0:
 2    40
1    40
0    40
Name: class, dtype: int64
----------
 2    10
1    10
0    10
Name: class, dtype: int64
************************************************************
Fold 1:
 2    40
1    40
0    40
Name: class, dtype: int64
----------
 2    10
0    10
1    10
Name: class, dtype: int64
************************************************************

How to Implement Stratified Sampling with Scikit-Learn

In this article, we will learn about How to Implement Stratified Sampling with Scikit-Learn.

Similar Reads

What is Stratified sampling?

Stratified sampling is a sampling technique in which the population is subdivided into groups based on specific characteristics relevant to the problem before sampling. The samples are drawn from this group with ample sizes proportional to the size of the subgroup in the population and combined to form the final sample. The purpose is to ensure that all subgroup is represented proportionally in the final sample....

When to use stratified sampling?

Population Heterogeneity: When the population can be divided into mutually exclusive subgroups based on a specific characteristic. Equal Representation: When we want to ensure a specific characteristics or group of characteristics is adequately represented in the final sample. Resource Constraints: When you want to generalize the study results to the entire population and ensure that the estimates are valid for each stratum, but the resources are limited....

Advantages of stratified sampling

Precision: By ensuring representation from all relevant subgroups, stratified sampling often provides more precise and reliable estimates for the entire population. Generalization: Results from the sample can be more confidently generalized to the entire population because each subgroup is accounted for in the sampling process. Reduced Variability: Stratified sampling can reduce the overall variability in the sample, leading to more accurate and meaningful statistical analyses....

Comparison started with other Sampling Techniques

Stratified sampling is just one of several sampling techniques used in research. Let’s compare stratified sampling with a few other common sampling techniques:...

Implementing Stratified Sampling

Let us load the iris dataset to implement stratified sampling....

Conclusion

...

Frequently Asked Questions (FAQs)

...