Class Imbalance Handling in Machine Learning
Resampling, which modifies the sample distribution, is a frequently used technique for handling very unbalanced datasets. This can be accomplished by either over-sampling, which adds more examples from the minority class, or under-sampling, which removes samples from the majority class. One method for reducing the difficulties caused by severely skewed datasets is resampling, which balances the class distribution.
Using strategies like over- and under-sampling to balance classes has advantages, but there are also disadvantages.
A fundamental method of over-sampling is to replicate random records from the minority class, which may cause overfitting.
On the other hand, information loss may occur from the simple technique of eliminating random records from the majority class in an undersampled situation.
In Up-sampling, samples from minority classes are randomly duplicated so as to achieve equivalence with the majority class. There are many methods used for achieving this.
1. Using Random Under-Sampling
When observations from the majority class are eliminated until the majority and minority classes are balanced, this is known as undersampling.
Undersampling has advantages when working with large datasets, especially ones with millions of rows, but there is a risk that important information will be lost during the removal process.
Example :
# Importing scikit-learn, pandas library
from sklearn.utils import resample
from sklearn.datasets import make_classification
import pandas as pd
# Making DataFrame having 100
# dummy samples with 4 features
# Divided in 2 classes in a ratio of 80:20
X, y = make_classification(n_classes=2,
weights=[0.8, 0.2],
n_features=4,
n_samples=100,
random_state=42)
df = pd.DataFrame(X, columns=['feature_1',
'feature_2',
'feature_3',
'feature_4'])
df['balance'] = y
print(df)
# Let df represent the dataset
# Dividing majority and minority classes
df_major = df[df.balance == 0]
df_minor = df[df.balance == 1]
# Upsampling minority class
df_minor_sample = resample(df_minor,
# Upsample with replacement
replace=True,
# Number to match majority class
n_samples=80,
random_state=42)
# Combine majority and upsampled minority class
df_sample = pd.concat([df_major, df_minor_sample])
# Display count of data points in both class
print(df_sample.balance.value_counts())
Output:
feature_1 feature_2 feature_3 feature_4 balance
0 -1.053839 -1.027544 -0.329294 0.826007 1
1 1.569317 1.306542 -0.239385 -0.331376 0
2 -0.658926 -0.357633 0.723682 -0.628277 0
3 -0.136856 0.460938 1.896911 -2.281386 0
4 -0.048629 0.502301 1.778730 -2.171053 0
.. ... ... ... ... ...
95 -2.241820 -1.248690 2.357902 -2.009185 0
96 0.573042 0.362054 -0.462814 0.341294 1
97 -0.375121 -0.149518 0.588465 -0.575002 0
98 1.042518 1.058239 0.461945 -0.984846 0
99 -0.121203 -0.043997 0.204211 -0.203119 0
[100 rows x 5 columns]
0 80
1 80
Name: balance, dtype: int64
Explanation :
- Firstly, we’ll divide the data points from each class into separate DataFrames.
- After this, the minority class is resampled with replacement by setting the number of data points equivalent to that of the majority class.
- In the end, we’ll concatenate the original majority class DataFrame and up-sampled minority class DataFrame.
2. Using RandomOverSampler:
Oversampling is the process of adding more copies to the minority class. When dealing with constrained data resources, this approach is helpful. Overfitting and decreased generalization performance on the test set are potential drawbacks of oversampling, though.
This can be done with the help of the RandomOverSampler method present in imblearn. This function randomly generates new data points belonging to the minority class with replacement (by default).
Syntax: RandomOverSampler(sampling_strategy=’auto’, random_state=None, shrinkage=None)
Parameters:
- sampling_strategy: Sampling Information for dataset.Some Values are- ‘minority’: only minority class ‘not minority’: all classes except minority class, ‘not majority’: all classes except majority class, ‘all’: all classes, ‘auto’: similar to ‘not majority’, Default value is ‘auto’
- random_state: Used for shuffling the data. If a positive non-zero number is given then it shuffles otherwise not. Default value is None.
- shrinkage: Parameter controlling the shrinkage. Values are: float: Shrinkage factor applied on all classes. dict: Every class will have a specific shrinkage factor. None: Shrinkage= 0. Default value is None.
Implementation of RandomOverSampler
# Importing imblearn,scikit-learn library
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_classification
# Making Dataset having 100
# dummy samples with 4 features
# Divided in 2 classes in a ratio of 80:20
X, y = make_classification(n_classes=2,
weights=[0.8, 0.2],
n_features=4,
n_samples=100,
random_state=42)
# Printing number of samples in
# each class before Over-Sampling
t = [(d) for d in y if d==0]
s = [(d) for d in y if d==1]
print('Before Over-Sampling: ')
print('Samples in class 0: ',len(t))
print('Samples in class 1: ',len(s))
# Over Sampling Minority class
OverS = RandomOverSampler(random_state=42)
# Fit predictor (x variable)
# and target (y variable) using fit_resample()
X_Over, Y_Over = OverS.fit_resample(X, y)
# Printing number of samples in
# each class after Over-Sampling
t = [(d) for d in Y_Over if d==0]
s = [(d) for d in Y_Over if d==1]
print('After Over-Sampling: ')
print('Samples in class 0: ',len(t))
print('Samples in class 1: ',len(s))
Output:
Before Over-Sampling:
Samples in class 0: 80
Samples in class 1: 20
After Over-Sampling:
Samples in class 0: 80
Samples in class 1: 80
- This code illustrates how to use imbalanced-learn’s RandomOverSampler to address class imbalance in a dataset.
- By creating artificial samples for the minority class, it improves the balance of the class distribution.
- For comparison, the number of samples in each class is printed both before and after oversampling.
How to Handle Imbalanced Classes in Machine Learning
In machine learning, “imbalanced classes” is a familiar problem particularly occurring in classification when we have datasets with an unequal ratio of data points in each class.
Training of model becomes much trickier as typical accuracy is no longer a reliable metric for measuring the performance of the model. Now if the number of data points in minority class is much less, then it may end up being completely ignored during training.