How does shuffling work with ImageDataGenerator in Machine Learning?

Tags:

I'm creating an image classification model with Inception V3 and have two classes. I've split my dataset and labels into two numpy arrays.The data is split with trainX and testY as the images and trainY and testY as the corresponding labels.

data = np.array(data, dtype="float")/255.0
labels = np.array(labels,dtype ="uint8")

(trainX, testX, trainY, testY) = train_test_split(
                                data,labels, 
                                test_size=0.2, 
                                random_state=42) 

train_datagen = keras.preprocessing.image.ImageDataGenerator(
          zoom_range = 0.1,
          width_shift_range = 0.2, 
          height_shift_range = 0.2,
          horizontal_flip = True,
          fill_mode ='nearest') 

val_datagen = keras.preprocessing.image.ImageDataGenerator()


train_generator = train_datagen.flow(
        trainX, 
        trainY,
        batch_size=batch_size,
        shuffle=True)

validation_generator = val_datagen.flow(
                testX,
                testY,
                batch_size=batch_size)

When I shuffle train_generator with ImageDataGenerator, will the images still match the corresponding labels? Also should the validation dataset be shuffled as well?

928

asked Aug 22 '18 14:08

student17

2 Answers

Yes, the images will still match the corresponding labels so you can safely set shuffle to True. Under the hood it works as follows. Calling .flow() on the ImageDataGenerator will return you a NumpyArrayIterator object, which implements the following logic for shuffling the indices:

def _set_index_array(self):
    self.index_array = np.arange(self.n)
    if self.shuffle: # if shuffle==True, shuffle the indices
        self.index_array = np.random.permutation(self.n)

self.index_array is then used to yield both the images (x) and the labels (y) (code truncated for readability):

def _get_batches_of_transformed_samples(self, index_array):
    batch_x = np.zeros(tuple([len(index_array)] + list(self.x.shape)[1:]),
                       dtype=self.dtype)
    # use index_array to get the x's
    for i, j in enumerate(index_array):
        x = self.x[j]
        ... # data augmentation is done here
        batch_x[i] = x
     ...
     # use the same index_array to fetch the labels
     output += (self.y[index_array],)

    return output

Check out the source code yourself, it might be easier to understand than you think.

Shuffling the validation data shouldn't matter too much. The main point of shuffling is to introduce some extra stochasticity in the training process.

142

answered Oct 24 '22 08:10

sdcbr

shuffle is "True" per default, so you must add

 train_generator = train_datagen.flow(
        trainX, 
        trainY,
        batch_size=batch_size,
        shuffle=False)

answered Oct 24 '22 10:10

Eliza

Related questions
                            
                                Pytorch What's the difference between define layer in __init__() and directly use in forward()?
                            
                                Why do changes to a nested dict inside dict2 affect dict1? [duplicate]
                            
                                TF.data.dataset.map(map_func) with Eager Mode
                            
                                Cloud Vision API Client threw an OS Error "too many open files"
                            
                                Is Python class variable static?
                            
                                Get a sub-graph from one node in NetworkX
                            
                                Auto resize tkinter window to fit all widgets
                            
                                Test for import of optional dependencies in __init__.py with pytest: Python 3.5 /3.6 differs in behaviour
                            
                                light gbm - python API vs Scikit-learn API
                            
                                keras LSTM layer takes too long to train
                            
                                Google Dataflow - Failed to import custom python modules
                            
                                PySpark Error When running SQL Query
                            
                                Preserving Spaces in Tesseract
                            
                                Image Preprocessing for OCR - Tessaract
                            
                                How to detect method calls made by Python behind the scenes?
                            
                                What is the difference, if any, between using single quote and double quote in a python dictionary? [duplicate]
                            
                                How to make a python mocked out function return a specific value conditional on an argument to the function?
                            
                                Add transparent picture over plot
                            
                                Cant Pickle memoized class instance
                            
                                "ImportError: Failed to load GLFW3 shared library" without root access on Linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does shuffling work with ImageDataGenerator in Machine Learning?

Tags:

python

machine-learning

tensorflow

computer-vision

keras

student17

People also ask

2 Answers

sdcbr

Eliza

Recent Activity

Donate For Us