Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use multiple directories for flow_from_directory in Keras

My scenario is that we have multiple peers with their own data, located in different directories, with the same sub-directory structure. I want to train the model using those data, but if I copy all of them to one folder, I can't keep track of which data is from whose (the new data is also created occasionally so it's not suitable to keep copy the files every time) My data is now stored like this:

-user01
-user02
-user03
...

(all of them have similar sub-directory structure)

I have searched for solution, but I only found the multi-input case in here and here, which they concatenate multiple input into 1 single parallel input, which is not my case.

I know that the flow_from_directory() can only be fed by 1 directory at a time, so how can I build a custom one that can be fed by multiple directory at a time?

If my question is low-quality, please give advice on how to improve it, I have searched also on the github of keras but didn't find anything that I can adapt.

Thank you.

like image 931
Thanh Nguyen Avatar asked Jul 16 '18 06:07

Thanh Nguyen


2 Answers

The Keras ImageDataGenerator flow_from_directory method has a follow_links parameter.

Maybe you can create one directory which is populated with symlinks to files in all the other directories.

This stack question discusses using symlinks with Keras ImageDataGenerator: Understanding 'follow_links' argument in Keras's ImageDataGenerator?

like image 68
user3731622 Avatar answered Sep 28 '22 20:09

user3731622


After so many days I hope you have found the solution to the problem, but I will share another idea here so that new people like me who will face the same problem in the future, get help.

A few days ago I had this kind of problem. follow_links will be a solution to your question, as user3731622 said. Also, I think the idea of ​​merging two data generators will work. However, in that case, the batch sizes of the corresponding data generators have to be determined proportion to the extent of data in each relevant directory.

Batch size of sub-generators: b = \frac{B \times n}{\sum{n}}

Where,
b = Batch Size Of Any Sub-generator
B = Desired Batch Size Of The Merged Generator
n = Number Of Images In That Directory Of Sub-generator
the sum of n = Total Number Of Images In All Directories

See the code below, this may help:

from keras.preprocessing.image import ImageDataGenerator
from keras.utils import Sequence
import matplotlib.pyplot as plt
import numpy as np
import os


class MergedGenerators(Sequence):

    def __init__(self, batch_size, generators=[], sub_batch_size=[]):
        self.generators = generators
        self.sub_batch_size = sub_batch_size
        self.batch_size = batch_size

    def __len__(self):
        return int(
            sum([(len(self.generators[idx]) * self.sub_batch_size[idx])
                 for idx in range(len(self.sub_batch_size))]) /
            self.batch_size)

    def __getitem__(self, index):
        """Getting items from the generators and packing them"""

        X_batch = []
        Y_batch = []
        for generator in self.generators:
            if generator.class_mode is None:
                x1 = generator[index % len(generator)]
                X_batch = [*X_batch, *x1]

            else:
                x1, y1 = generator[index % len(generator)]
                X_batch = [*X_batch, *x1]
                Y_batch = [*Y_batch, *y1]

        if self.generators[0].class_mode is None:
            return np.array(X_batch)
        return np.array(X_batch), np.array(Y_batch)


def build_datagenerator(dir1=None, dir2=None, batch_size=32):
    n_images_in_dir1 = sum([len(files) for r, d, files in os.walk(dir1)])
    n_images_in_dir2 = sum([len(files) for r, d, files in os.walk(dir2)])

    # Have to set different batch size for two generators as number of images
    # in those two directories are not same. As we have to equalize the image
    # share in the generators
    generator1_batch_size = int((n_images_in_dir1 * batch_size) /
                                (n_images_in_dir1 + n_images_in_dir2))

    generator2_batch_size = batch_size - generator1_batch_size

    generator1 = ImageDataGenerator(
        rescale=1. / 255,
        shear_range=0.2,
        zoom_range=0.2,
        rotation_range=5.,
        horizontal_flip=True,
    )

    generator2 = ImageDataGenerator(
        rescale=1. / 255,
        zoom_range=0.2,
        horizontal_flip=False,
    )

    # generator2 has different image augmentation attributes than generaor1
    generator1 = generator1.flow_from_directory(
        dir1,
        target_size=(128, 128),
        color_mode='rgb',
        class_mode=None,
        batch_size=generator1_batch_size,
        shuffle=True,
        seed=42,
        interpolation="bicubic",
    )

    generator2 = generator2.flow_from_directory(
        dir2,
        target_size=(128, 128),
        color_mode='rgb',
        class_mode=None,
        batch_size=generator2_batch_size,
        shuffle=True,
        seed=42,
        interpolation="bicubic",
    )

    return MergedGenerators(
        batch_size,
        generators=[generator1, generator2],
        sub_batch_size=[generator1_batch_size, generator2_batch_size])


def test_datagen(batch_size=32):
    datagen = build_datagenerator(dir1="./asdf",
                                  dir2="./asdf2",
                                  batch_size=batch_size)

    print("Datagenerator length (Batch count):", len(datagen))

    for batch_count, image_batch in enumerate(datagen):
        if batch_count == 1:
            break

        print("Images: ", image_batch.shape)

        plt.figure(figsize=(10, 10))
        for i in range(image_batch.shape[0]):
            plt.subplot(1, batch_size, i + 1)
            plt.imshow(image_batch[i], interpolation='nearest')
            plt.axis('off')
            plt.tight_layout()


test_datagen(4)

like image 36
Arafat Hasan Avatar answered Sep 28 '22 20:09

Arafat Hasan