I'm trying to do a binary classification problem with Keras, using the ImageDataGenerator.flow_from_directory
method to generate batches. However, my classes are very imbalanced, like about 8x or 9x more in one class than the other, causing the model to get stuck predicting the same output class for every example. Is there a way to set flow_from_directory
to either oversample from my small class or undersample from my large class during each epoch? For now, I've just created multiple copies of each image in my smaller class, but I'd like to have a bit more flexibility.
Oversampling methods duplicate or create new synthetic examples in the minority class, whereas undersampling methods delete or merge examples in the majority class. Both types of resampling can be effective when used in isolation, although can be more effective when both types of methods are used together.
What is the difference between these two techniques? Undersampling would decrease the proportion of your majority class until the number is similar to the minority class. At the same time, Oversampling would resample the minority class proportion following the majority class proportion.
Choosing an oversampling rate 2x or more instructs the algorithm to upsample the incoming signal thereby temporarily raising the Nyquist frequency so there are fewer artifacts and reduced aliasing. Higher levels of oversampling results in less aliasing occurring in the audible range.
What is keras ImageDataGenerator? Keras image data generator is used for the generation of the batches containing the data of tensor images and is used in the domain of real-time data augmentation. We can loop over the data in batches when we make use of the image data generator in Keras.
With current version of Keras - it's not possible to balance your dataset using only Keras built-in methods. The flow_from_directory
is simply building a list of all files and their classes, shuffling it (if need) and then it's iterating over it.
But you could do a different trick - by writting your own generator which would make the balancing inside the python
:
def balanced_flow_from_directory(flow_from_directory, options):
for x, y in flow_from_directory:
yield custom_balance(x, y, options)
Here custom_balance
should be a function that given a batch (x, y)
is balancing it and returning a balanced batch (x', y')
. For most of the applications the size of the batch doesn't need to be the same - but there are some weird use cases (like e.g. stateful RNNs) - where batch sizes should have a fixed size).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With