balancing an imbalanced dataset with keras image generator

Tags:

keras

The keras

ImageDataGenerator

can be used to "Generate batches of tensor image data with real-time data augmentation"

The tutorial here demonstrates how a small but balanced dataset can be augmented using the ImageDataGenerator. Is there an easy way to use this generator to augment a heavily unbalanced dataset, such that the resulting, generated dataset is balanced?

203

asked Jan 14 '17 08:01

user1934212

2 Answers

This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.

The more standard approaches would be:

the class_weights argument in model.fit, which you can use to make the model learn more from the minority class.
reducing the size of the majority class.
accepting the imbalance. Deep learning can cope with this, it just needs lots more data (the solution to everything, really).

The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).

The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).

If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.

answered Sep 21 '22 09:09

Luke_radio

You can use this strategy to calculate weights based on the imbalance:

from sklearn.utils import class_weight  import numpy as np  class_weights = class_weight.compute_class_weight(            'balanced',             np.unique(train_generator.classes),              train_generator.classes)  train_class_weights = dict(enumerate(class_weights)) model.fit_generator(..., class_weight=train_class_weights)

This answer was inspire by Is it possible to automatically infer the class_weight from flow_from_directory in Keras?

answered Sep 19 '22 09:09

Taísa Felix

Related questions
                            
                                How do I mask a loss function in Keras with the TensorFlow backend?
                            
                                Understanding Keras LSTMs: Role of Batch-size and Statefulness
                            
                                Create keras callback to save model predictions and targets for each batch during training
                            
                                Keras flowFromDirectory get file names as they are being generated
                            
                                Loading model with custom loss + keras
                            
                                How to convert keras(h5) file to a tflite file?
                            
                                How to use advanced activation layers in Keras?
                            
                                ValueError: Layer sequential_20 expects 1 inputs, but it received 2 input tensors
                            
                                Keras: model.predict for a single image
                            
                                How do you get the name of the tensorflow output nodes in a Keras Model?
                            
                                Error when checking model input: expected lstm_1_input to have 3 dimensions, but got array with shape (339732, 29)
                            
                                Multivariate LSTM with missing values
                            
                                Is there an easy way to get something like Keras model.summary in Tensorflow?
                            
                                Can't save custom subclassed model
                            
                                Why is my GPU slower than CPU when training LSTM/RNN models?
                            
                                what is the difference between Flatten() and GlobalAveragePooling2D() in keras
                            
                                Error "Keras requires TensorFlow 2.2 or higher"
                            
                                What is the definition of a non-trainable parameter?
                            
                                How to work with multiple inputs for LSTM in Keras?
                            
                                What's the difference between "samples_per_epoch" and "steps_per_epoch" in fit_generator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With