How to balance classes in a numpy array?

Question

I have 2 numpy arrays as follows:

images contains the names of image files (images.shape is (N, 3, 128, 128)): image_1.jpg image_2.jpg image_3.jpg image_4.jpg

labels contains the corresponding labels (0-3) (labels.shape is (N,)): 1 1 3 2

The issue I'm facing is that the classes are imbalanced, with class 3 >> 1 > 2 > 0.

I'd like to balance the final dataset by:

counting the number of images (samples) in each class
get the count of the class with lowest number of images
use that count as the maximum number of images / labels for the other 3 classes
randomly pop excess images / labels from the other 3 classes in images and labels

So far I'm using Counter to identify the number of images per class:

from Collections import Counter
import numpy as np

count = Counter(labels)
print(count)

>>>Counter({'1': 2991, '0': 2953, '2': 2510, '3': 2488})

How would you suggest I randomly pop matching elements from images and labels so they contain 2488 samples of classes 0, 1, and 2?

maxymoo · Accepted Answer

You could use np.random.choice to create an integer-valued mask which you could apply to your labels and images to balance the dataset:

n = 2488

mask = np.hstack([np.random.choice(np.where(labels == l)[0], n, replace=False)
                      for l in np.unique(labels)])

Donate For Us