Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle invalid/corrupted image files in ImageDataGenerator.flow_from_directory in Keras

I am using Python with Keras and running ImageDataGenerator and using flow_from_directory. I have some problematic image files, so can I use the data generator in order to handle the read errors?

I am getting some "not valid jpg file" on a small portion of the images and would like to treat this without my code crashing.

like image 462
thebeancounter Avatar asked Sep 20 '18 21:09

thebeancounter


1 Answers

Well, one solution is to modify the ImageDataGenerator code and put error handling mechanism (i.e. try/except) in it.

However, one alternative is to wrap your generator inside another generator and use try/except there. The disadvantage of this solution is that it throws away the whole generated batch even if one single image is corrupted in that batch (this may mean that it is possible that some of the samples may not be used for training at all):

data_gen = ImageDataGenerator(...)

train_gen = data_gen.flow_from_directory(...)

def my_gen(gen):
    while True:
        try:
            data, labels = next(gen)
            yield data, labels
        except:
            pass

# ... define your model and compile it

# fit the model
model.fit_generator(my_gen(train_gen), ...)

Another disadvantage of this solution is that since you need to specify the number of steps of generator (i.e. steps_per_epoch) and considering that a batch may be thrown away in a step and a new batch is fetched instead in the same step, you may end up training on some of the samples more than once in an epoch. This may or may not have significant effects depending on how many batches include corrupted images (i.e. if there are a few, then there is nothing to be worried about that much).

Finally, note that you may want to use the newer Keras data-generator i.e. Sequence class to read images one by one in the __getitem__ method in each batch and discard corrupted ones. However, the problem of the previous approach, i.e. training on some of the images more than once, is still present in this approach as well since you also need to implement the __len__ method and it is essentially equivalent to steps_per_epoch argument. Although, in my opinion, this approach (i.e. subclassing Sequence class) is superior to the above approach (of course, if you put aside the fact that you may need to write more code) and have fewer side effects (since you can discard a single image and not the whole batch).

like image 194
today Avatar answered Oct 16 '22 06:10

today