I have a DataFrame in pandas that contain training examples, for example:
feature1 feature2 class
0 0.548814 0.791725 1
1 0.715189 0.528895 0
2 0.602763 0.568045 0
3 0.544883 0.925597 0
4 0.423655 0.071036 0
5 0.645894 0.087129 0
6 0.437587 0.020218 0
7 0.891773 0.832620 1
8 0.963663 0.778157 0
9 0.383442 0.870012 0
which I generated using:
import pandas as pd
import numpy as np
np.random.seed(0)
number_of_samples = 10
frame = pd.DataFrame({
'feature1': np.random.random(number_of_samples),
'feature2': np.random.random(number_of_samples),
'class': np.random.binomial(2, 0.1, size=number_of_samples),
},columns=['feature1','feature2','class'])
print(frame)
As you can see, the training set is imbalanced (8 samples have class 0, while only 2 samples have class 1). I would like to oversample the training set. Specifically, I would like to duplicating training samples with class 1 so that the training set is balanced (i.e., where the number of samples with class 0 is approximately the same as the number of samples with class 1). How can I do so?
Ideally I would like a solution that may generalize to a multiclass setting (i.e., the integer in the class column may be more than 1).
Resampling Technique A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).
Random Undersampling Instead of resampling minority records, we can instead randomly undersample the majority class to create a balanced dataset. Each class now has a sample size equal to the minority class. After training the model we get 83% accuracy on the test set.
Imbalanced data affects the performance of the classification model. Thus to handle the imbalanced data, Sampling techniques are used. There are two types of sampling techniques available: Undersampling and Oversampling. Undersampling selects the instances from the majority class to keep and delete.
You can find the maximum size a group has with
max_size = frame['class'].value_counts().max()
In your example, this equals 8. For each group, you can sample with replacement max_size - len(group_size)
elements. This way if you concat these to the original DataFrame, their sizes will be the same and you'll keep the original rows.
lst = [frame]
for class_index, group in frame.groupby('class'):
lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)
You can play with max_size-len(group)
and maybe add some noise to it because this will make all group sizes equal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With