Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

balancing an imbalanced dataset with keras image generator

Tags:

keras

The keras

ImageDataGenerator 

can be used to "Generate batches of tensor image data with real-time data augmentation"

The tutorial here demonstrates how a small but balanced dataset can be augmented using the ImageDataGenerator. Is there an easy way to use this generator to augment a heavily unbalanced dataset, such that the resulting, generated dataset is balanced?

like image 203
user1934212 Avatar asked Jan 14 '17 08:01

user1934212


People also ask

How do you balance an imbalanced image dataset?

One of the basic approaches to deal with the imbalanced datasets is to do data augmentation and re-sampling. There are two types of re-sampling such as under-sampling when we removing the data from the majority class and over-sampling when we adding repetitive data to the minority class.

How does CNN deal with imbalanced data?

Unbalanced dataset is a common issue in all areas and does not specifically concern computer vision and problems dealt by Convolutional Neural Networks (CNNs). To tackle this problem you should try to balance your dataset, either by over-sampling minority classes or under-sampling majority classes (or both).


2 Answers

This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.

The more standard approaches would be:

  • the class_weights argument in model.fit, which you can use to make the model learn more from the minority class.
  • reducing the size of the majority class.
  • accepting the imbalance. Deep learning can cope with this, it just needs lots more data (the solution to everything, really).

The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).

The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).

If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.

like image 88
Luke_radio Avatar answered Sep 21 '22 09:09

Luke_radio


You can use this strategy to calculate weights based on the imbalance:

from sklearn.utils import class_weight  import numpy as np  class_weights = class_weight.compute_class_weight(            'balanced',             np.unique(train_generator.classes),              train_generator.classes)  train_class_weights = dict(enumerate(class_weights)) model.fit_generator(..., class_weight=train_class_weights) 

This answer was inspire by Is it possible to automatically infer the class_weight from flow_from_directory in Keras?

like image 45
Taísa Felix Avatar answered Sep 19 '22 09:09

Taísa Felix