Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

keras image preprocessing unbalanced data

All,

I'm trying to use Keras to do image classification on two classes. For one class, I have very limited number of images, say 500. As for the other class, I have almost infinite number of images. So if I want to use keras image preprocessing, how to do that? Ideally, I need something like this. For class one, I feed 500 images and use ImageDataGenerator to get more images. For class two, each time I extract 500 images in sequence from 1000000 image dataset and probably no data augmentation needed. While looking at the example here and also Keras documentation, I found the training folder contains equal number of images for each class by default. So my question is that is there existing APIs for doing this trick? If so, please kindly point it out to me. If not, is there any workaround to this needs?

like image 338
Jane Avatar asked Jun 21 '17 04:06

Jane


People also ask

How do you deal with unbalanced image data?

One of the basic approaches to deal with the imbalanced datasets is to do data augmentation and re-sampling. There are two types of re-sampling such as under-sampling when we removing the data from the majority class and over-sampling when we adding repetitive data to the minority class.

Can you use smote on images?

Once DeepSMOTE is trained, images can be generated with the encoder / decoder structure. The encoder reduces the raw input to a lower dimensional feature space, which is oversampled by SMOTE. The decoder then decodes the SMOTEd features into images, which can augment the training set of a deep learning classifier.

How do you handle an imbalanced Textset?

The simplest way to fix imbalanced dataset is simply balancing them by oversampling instances of the minority class or undersampling instances of the majority class. Using advanced techniques like SMOTE(Synthetic Minority Over-sampling Technique) will help you create new synthetic instances from minority class.

Which algorithm is best for unbalanced data?

A widely adopted and perhaps the most straightforward method for dealing with highly imbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).


1 Answers

You have some options.

Option 1

Use the class_weight parameter of the fit() function which is a dictionary mapping classes to a weight value. Lets say you have 500 samples of class 0 and 1500 samples of class 1 than you feed in class_weight = {0:3 , 1:1}. That gives class 0 three times the weight of class 1.

train_generator.classes gives you the proper class names for your weighting.

If you want to calculate this programmatically than you could use scikit-learn´s sklearn.utils.compute_class_weight(): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/class_weight.py

The function looks at the distribution of labels and produces weights to equally penalize under or over-represented classes in the training set.

See also this useful thread here: https://github.com/fchollet/keras/issues/1875

This thread might also be of help: Is it possible to automatically infer the class_weight from flow_from_directory in Keras?

Option 2

You use a dummy training run with a generator where you apply your image augmentation like rotation, scaling, cropping, flipping etc. and save the augmented images for the real training later. By that you can create a bigger or even balanced dataset for your underrepresented class.

In this dummy run you set save_to_dir in the flow_from_directory function to a folder of your choosing and later on only take the images from the class that you need more samples of. You obviously discard any training results since you only use this run to get more data.

like image 72
petezurich Avatar answered Oct 03 '22 08:10

petezurich