Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

keras flow_from_directory over or undersample a class

I'm trying to do a binary classification problem with Keras, using the ImageDataGenerator.flow_from_directory method to generate batches. However, my classes are very imbalanced, like about 8x or 9x more in one class than the other, causing the model to get stuck predicting the same output class for every example. Is there a way to set flow_from_directory to either oversample from my small class or undersample from my large class during each epoch? For now, I've just created multiple copies of each image in my smaller class, but I'd like to have a bit more flexibility.

like image 286
George Avatar asked Jan 23 '17 20:01

George


People also ask

Should I oversample or Undersample?

Oversampling methods duplicate or create new synthetic examples in the minority class, whereas undersampling methods delete or merge examples in the majority class. Both types of resampling can be effective when used in isolation, although can be more effective when both types of methods are used together.

What is the difference between smote and oversampling?

What is the difference between these two techniques? Undersampling would decrease the proportion of your majority class until the number is similar to the minority class. At the same time, Oversampling would resample the minority class proportion following the majority class proportion.

How much should you oversample?

Choosing an oversampling rate 2x or more instructs the algorithm to upsample the incoming signal thereby temporarily raising the Nyquist frequency so there are fewer artifacts and reduced aliasing. Higher levels of oversampling results in less aliasing occurring in the audible range.

What is the purpose of ImageDataGenerator?

What is keras ImageDataGenerator? Keras image data generator is used for the generation of the batches containing the data of tensor images and is used in the domain of real-time data augmentation. We can loop over the data in batches when we make use of the image data generator in Keras.


1 Answers

With current version of Keras - it's not possible to balance your dataset using only Keras built-in methods. The flow_from_directory is simply building a list of all files and their classes, shuffling it (if need) and then it's iterating over it.

But you could do a different trick - by writting your own generator which would make the balancing inside the python:

def balanced_flow_from_directory(flow_from_directory, options):
    for x, y in flow_from_directory:
         yield custom_balance(x, y, options)

Here custom_balance should be a function that given a batch (x, y) is balancing it and returning a balanced batch (x', y'). For most of the applications the size of the batch doesn't need to be the same - but there are some weird use cases (like e.g. stateful RNNs) - where batch sizes should have a fixed size).

like image 174
Marcin Możejko Avatar answered Oct 16 '22 20:10

Marcin Możejko