Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fit Image augmentations to training data using flow_from_directory

I want to use Image augmentation in Keras. My current code looks like this:

# define image augmentations
train_datagen = ImageDataGenerator(
featurewise_center=True,
featurewise_std_normalization=True,
zca_whitening=True)

# generate image batches from directory
train_datagen.flow_from_directory(train_dir)

When I run a model with this, I get the following error:

"ImageDataGenerator specifies `featurewise_std_normalization`, but it hasn't been fit on any training data."

But I didn't find clear information about how to use train_dataget.fit() together with flow_from_directory.

like image 419
Mario Kreutzfeldt Avatar asked Oct 12 '17 09:10

Mario Kreutzfeldt


1 Answers

You are right, the docs are not very enlightening on this ...

What you need is actually a 4-step process:

  1. Define your data augmentation
  2. Fit the augmentation
  3. Setup your generator using flow_from_directory()
  4. Train your model with fit_generator()

Here is the necessary code for a hypothetical image classification case:

# define data augmentation configuration
train_datagen = ImageDataGenerator(featurewise_center=True,
                                   featurewise_std_normalization=True,
                                   zca_whitening=True)

# fit the data augmentation
train_datagen.fit(x_train)

# setup generator
train_generator = train_datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='categorical')

# train model
model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples,
    epochs=epochs,
    validation_data=validation_generator, # optional - if used needs to be defined
    validation_steps=nb_validation_samples) 
    

Clearly, there are several parameters to be defined (train_data_dir, nb_train_samples etc), but hopefully you get the idea.

If you need to also use a validation_generator, as in my example, this should be defined the same way as your train_generator.

UPDATE (after comment)

Step 2 needs some discussion; here, x_train are the actual data which, ideally, should fit into the main memory. Also (documentation), this step is

Only required if featurewise_center or featurewise_std_normalization or zca_whitening.

However, there are many real-world cases where the requirement that all the training data fit into memory is clearly unrealistic. How you center/normalize/white data in such cases is a (huge) sub-field in itself, and arguably the main reason for the existence of big data processing frameworks such as Spark.

So, what to do in practice here? Well, the next logical action in such a case is to sample your data; indeed, this is exactly what the community advises - here is Keras creator Francois Chollet on Working with large datasets like Imagenet:

datagen.fit(X_sample) # let's say X_sample is a small-ish but statistically representative sample of your data

And another quote from an ongoing open discussion about extending ImageDataGenerator (emphasis added):

fit is required for feature-wise standardization and ZCA , and it only takes an array as parameter, there is no fit for directory. For now, we need to manually read a subset of the image to do this fit for a directory. One idea is we can change fit() to accept the generator itself(flow_from_directory), of course, standardization should be disabled during fit.

like image 140
desertnaut Avatar answered Sep 22 '22 18:09

desertnaut