I want to use Image augmentation in Keras. My current code looks like this:
# define image augmentations
train_datagen = ImageDataGenerator(
featurewise_center=True,
featurewise_std_normalization=True,
zca_whitening=True)
# generate image batches from directory
train_datagen.flow_from_directory(train_dir)
When I run a model with this, I get the following error:
"ImageDataGenerator specifies `featurewise_std_normalization`, but it hasn't been fit on any training data."
But I didn't find clear information about how to use train_dataget.fit()
together with flow_from_directory
.
You are right, the docs are not very enlightening on this ...
What you need is actually a 4-step process:
flow_from_directory()
fit_generator()
Here is the necessary code for a hypothetical image classification case:
# define data augmentation configuration
train_datagen = ImageDataGenerator(featurewise_center=True,
featurewise_std_normalization=True,
zca_whitening=True)
# fit the data augmentation
train_datagen.fit(x_train)
# setup generator
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
# train model
model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples,
epochs=epochs,
validation_data=validation_generator, # optional - if used needs to be defined
validation_steps=nb_validation_samples)
Clearly, there are several parameters to be defined (train_data_dir
, nb_train_samples
etc), but hopefully you get the idea.
If you need to also use a validation_generator
, as in my example, this should be defined the same way as your train_generator
.
UPDATE (after comment)
Step 2 needs some discussion; here, x_train
are the actual data which, ideally, should fit into the main memory. Also (documentation), this step is
Only required if featurewise_center or featurewise_std_normalization or zca_whitening.
However, there are many real-world cases where the requirement that all the training data fit into memory is clearly unrealistic. How you center/normalize/white data in such cases is a (huge) sub-field in itself, and arguably the main reason for the existence of big data processing frameworks such as Spark.
So, what to do in practice here? Well, the next logical action in such a case is to sample your data; indeed, this is exactly what the community advises - here is Keras creator Francois Chollet on Working with large datasets like Imagenet:
datagen.fit(X_sample) # let's say X_sample is a small-ish but statistically representative sample of your data
And another quote from an ongoing open discussion about extending ImageDataGenerator
(emphasis added):
fit is required for feature-wise standardization and ZCA , and it only takes an array as parameter, there is no fit for directory. For now, we need to manually read a subset of the image to do this fit for a directory. One idea is we can change
fit()
to accept the generator itself(flow_from_directory
), of course, standardization should be disabled during fit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With