Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Duplicating training examples to handle class imbalance in a pandas data frame

I have a DataFrame in pandas that contain training examples, for example:

   feature1  feature2  class
0  0.548814  0.791725      1
1  0.715189  0.528895      0
2  0.602763  0.568045      0
3  0.544883  0.925597      0
4  0.423655  0.071036      0
5  0.645894  0.087129      0
6  0.437587  0.020218      0
7  0.891773  0.832620      1
8  0.963663  0.778157      0
9  0.383442  0.870012      0

which I generated using:

import pandas as pd
import numpy as np

np.random.seed(0)
number_of_samples = 10
frame = pd.DataFrame({
    'feature1': np.random.random(number_of_samples),
    'feature2': np.random.random(number_of_samples),
    'class':    np.random.binomial(2, 0.1, size=number_of_samples), 
    },columns=['feature1','feature2','class'])

print(frame)

As you can see, the training set is imbalanced (8 samples have class 0, while only 2 samples have class 1). I would like to oversample the training set. Specifically, I would like to duplicating training samples with class 1 so that the training set is balanced (i.e., where the number of samples with class 0 is approximately the same as the number of samples with class 1). How can I do so?

Ideally I would like a solution that may generalize to a multiclass setting (i.e., the integer in the class column may be more than 1).

like image 987
Franck Dernoncourt Avatar asked Jan 22 '18 00:01

Franck Dernoncourt


People also ask

Which of the following techniques can be used to deal with a dataset having imbalanced classes?

Resampling Technique A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

What are the strategies to address the class imbalance problem?

Random Undersampling Instead of resampling minority records, we can instead randomly undersample the majority class to create a balanced dataset. Each class now has a sample size equal to the minority class. After training the model we get 83% accuracy on the test set.

How do you deal with class imbalance in Python?

Imbalanced data affects the performance of the classification model. Thus to handle the imbalanced data, Sampling techniques are used. There are two types of sampling techniques available: Undersampling and Oversampling. Undersampling selects the instances from the majority class to keep and delete.


1 Answers

You can find the maximum size a group has with

max_size = frame['class'].value_counts().max()

In your example, this equals 8. For each group, you can sample with replacement max_size - len(group_size) elements. This way if you concat these to the original DataFrame, their sizes will be the same and you'll keep the original rows.

lst = [frame]
for class_index, group in frame.groupby('class'):
    lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)

You can play with max_size-len(group) and maybe add some noise to it because this will make all group sizes equal.

like image 57
ayhan Avatar answered Oct 13 '22 20:10

ayhan