Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does shuffling work with ImageDataGenerator in Machine Learning?

I'm creating an image classification model with Inception V3 and have two classes. I've split my dataset and labels into two numpy arrays.The data is split with trainX and testY as the images and trainY and testY as the corresponding labels.

data = np.array(data, dtype="float")/255.0
labels = np.array(labels,dtype ="uint8")

(trainX, testX, trainY, testY) = train_test_split(
                                data,labels, 
                                test_size=0.2, 
                                random_state=42) 

train_datagen = keras.preprocessing.image.ImageDataGenerator(
          zoom_range = 0.1,
          width_shift_range = 0.2, 
          height_shift_range = 0.2,
          horizontal_flip = True,
          fill_mode ='nearest') 

val_datagen = keras.preprocessing.image.ImageDataGenerator()


train_generator = train_datagen.flow(
        trainX, 
        trainY,
        batch_size=batch_size,
        shuffle=True)

validation_generator = val_datagen.flow(
                testX,
                testY,
                batch_size=batch_size) 

When I shuffle train_generator with ImageDataGenerator, will the images still match the corresponding labels? Also should the validation dataset be shuffled as well?

like image 928
student17 Avatar asked Aug 22 '18 14:08

student17


People also ask

What does ImageDataGenerator module do?

Keras ImageDataGenerator is used for getting the input of the original data and further, it makes the transformation of this data on a random basis and gives the output resultant containing only the data that is newly transformed.

How do I iterate through ImageDataGenerator?

The pattern for using the ImageDataGenerator class is used as follows: Construct and configure an instance of the ImageDataGenerator class. Retrieve an iterator by calling the flow_from_directory() function. Use the iterator in the training or evaluation of a model.

What method is used to fit a model on batches from an ImageDataGenerator?

You can do this by calling the fit() function on the data generator and passing it to your training dataset. The data generator itself is, in fact, an iterator, returning batches of image samples when requested.

How does flow_from_directory work?

flow_from_directory Method This method will identify classes automatically from the folder name. For this method, arguments to be used are: directory value : The path to parent directory containing sub-directories(class/label) with images. classes value : Name of the class/classes for which images should be loaded.


2 Answers

Yes, the images will still match the corresponding labels so you can safely set shuffle to True. Under the hood it works as follows. Calling .flow() on the ImageDataGenerator will return you a NumpyArrayIterator object, which implements the following logic for shuffling the indices:

def _set_index_array(self):
    self.index_array = np.arange(self.n)
    if self.shuffle: # if shuffle==True, shuffle the indices
        self.index_array = np.random.permutation(self.n) 

self.index_array is then used to yield both the images (x) and the labels (y) (code truncated for readability):

def _get_batches_of_transformed_samples(self, index_array):
    batch_x = np.zeros(tuple([len(index_array)] + list(self.x.shape)[1:]),
                       dtype=self.dtype)
    # use index_array to get the x's
    for i, j in enumerate(index_array):
        x = self.x[j]
        ... # data augmentation is done here
        batch_x[i] = x
     ...
     # use the same index_array to fetch the labels
     output += (self.y[index_array],)

    return output

Check out the source code yourself, it might be easier to understand than you think.

Shuffling the validation data shouldn't matter too much. The main point of shuffling is to introduce some extra stochasticity in the training process.

like image 142
sdcbr Avatar answered Oct 24 '22 08:10

sdcbr


shuffle is "True" per default, so you must add

 train_generator = train_datagen.flow(
        trainX, 
        trainY,
        batch_size=batch_size,
        shuffle=False)
like image 1
Eliza Avatar answered Oct 24 '22 10:10

Eliza