Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to balance classes in a numpy array?

I have 2 numpy arrays as follows:

images contains the names of image files (images.shape is (N, 3, 128, 128)): image_1.jpg image_2.jpg image_3.jpg image_4.jpg

labels contains the corresponding labels (0-3) (labels.shape is (N,)): 1 1 3 2

The issue I'm facing is that the classes are imbalanced, with class 3 >> 1 > 2 > 0.

I'd like to balance the final dataset by:

  • counting the number of images (samples) in each class
  • get the count of the class with lowest number of images
  • use that count as the maximum number of images / labels for the other 3 classes
  • randomly pop excess images / labels from the other 3 classes in images and labels

So far I'm using Counter to identify the number of images per class:

from Collections import Counter
import numpy as np

count = Counter(labels)
print(count)

>>>Counter({'1': 2991, '0': 2953, '2': 2510, '3': 2488})

How would you suggest I randomly pop matching elements from images and labels so they contain 2488 samples of classes 0, 1, and 2?

like image 200
pepe Avatar asked Feb 17 '26 11:02

pepe


1 Answers

You could use np.random.choice to create an integer-valued mask which you could apply to your labels and images to balance the dataset:

n = 2488

mask = np.hstack([np.random.choice(np.where(labels == l)[0], n, replace=False)
                      for l in np.unique(labels)])
like image 185
maxymoo Avatar answered Feb 20 '26 01:02

maxymoo