Duplicating training examples to handle class imbalance in a pandas data frame

Tags:

I have a DataFrame in pandas that contain training examples, for example:

   feature1  feature2  class
0  0.548814  0.791725      1
1  0.715189  0.528895      0
2  0.602763  0.568045      0
3  0.544883  0.925597      0
4  0.423655  0.071036      0
5  0.645894  0.087129      0
6  0.437587  0.020218      0
7  0.891773  0.832620      1
8  0.963663  0.778157      0
9  0.383442  0.870012      0

which I generated using:

import pandas as pd
import numpy as np

np.random.seed(0)
number_of_samples = 10
frame = pd.DataFrame({
    'feature1': np.random.random(number_of_samples),
    'feature2': np.random.random(number_of_samples),
    'class':    np.random.binomial(2, 0.1, size=number_of_samples), 
    },columns=['feature1','feature2','class'])

print(frame)

As you can see, the training set is imbalanced (8 samples have class 0, while only 2 samples have class 1). I would like to oversample the training set. Specifically, I would like to duplicating training samples with class 1 so that the training set is balanced (i.e., where the number of samples with class 0 is approximately the same as the number of samples with class 1). How can I do so?

Ideally I would like a solution that may generalize to a multiclass setting (i.e., the integer in the class column may be more than 1).

987

asked Jan 22 '18 00:01

Franck Dernoncourt

1 Answers

You can find the maximum size a group has with

max_size = frame['class'].value_counts().max()

In your example, this equals 8. For each group, you can sample with replacement max_size - len(group_size) elements. This way if you concat these to the original DataFrame, their sizes will be the same and you'll keep the original rows.

lst = [frame]
for class_index, group in frame.groupby('class'):
    lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)

You can play with max_size-len(group) and maybe add some noise to it because this will make all group sizes equal.

answered Oct 13 '22 20:10

ayhan

Related questions
                            
                                converting two digit integer into single digit inside a python list?
                            
                                Why does ast.literal_eval('5 * 7') fail?
                            
                                Outlook using python win32com to iterate subfolders
                            
                                Find count of characters within the string in Python
                            
                                ImportError: No module named geopandas
                            
                                closing session in tensorflow doesn't reset graph
                            
                                Python (Pandas) Add subtotal on each lvl of multiindex dataframe
                            
                                pip install pickle not working - no such file or directory
                            
                                expanding a dataframe based on start and end columns (speed)
                            
                                How to remove the quotes from a string for SQL query in Python?
                            
                                Convert column values to lower case only if they are string
                            
                                How to remove all the values in a string except for the chosen ones [duplicate]
                            
                                json.loads() doesn't keep order [duplicate]
                            
                                Check if module is running in Jupyter or not
                            
                                Is there a way to delete all cells at once in jupyter?
                            
                                Python download youtube with specific filename
                            
                                Mask from max values in numpy array, specific axis
                            
                                How to delete a global variable from inside a function?
                            
                                Why do I get an AttributeError when using pandas apply?
                            
                                How to make a command case insensitive in discord.py

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Duplicating training examples to handle class imbalance in a pandas data frame

Tags:

python

pandas

machine-learning

oversampling

Franck Dernoncourt

People also ask

1 Answers

ayhan

Recent Activity

Donate For Us