The keras
ImageDataGenerator
can be used to "Generate batches of tensor image data with real-time data augmentation"
The tutorial here demonstrates how a small but balanced dataset can be augmented using the ImageDataGenerator. Is there an easy way to use this generator to augment a heavily unbalanced dataset, such that the resulting, generated dataset is balanced?
One of the basic approaches to deal with the imbalanced datasets is to do data augmentation and re-sampling. There are two types of re-sampling such as under-sampling when we removing the data from the majority class and over-sampling when we adding repetitive data to the minority class.
Unbalanced dataset is a common issue in all areas and does not specifically concern computer vision and problems dealt by Convolutional Neural Networks (CNNs). To tackle this problem you should try to balance your dataset, either by over-sampling minority classes or under-sampling majority classes (or both).
This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.
The more standard approaches would be:
The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).
The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).
If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.
You can use this strategy to calculate weights based on the imbalance:
from sklearn.utils import class_weight import numpy as np class_weights = class_weight.compute_class_weight( 'balanced', np.unique(train_generator.classes), train_generator.classes) train_class_weights = dict(enumerate(class_weights)) model.fit_generator(..., class_weight=train_class_weights)
This answer was inspire by Is it possible to automatically infer the class_weight from flow_from_directory in Keras?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With