I have a custom file containing the paths to all my images and their labels which I load in a dataframe using:
MyIndex=pd.read_table('./MySet.txt')
MyIndex has two columns of interest ImagePath and ClassName
Next I do some train test split and encoding the output labels as:
images=[]
for index, row in MyIndex.iterrows():
img_path=basePath+row['ImageName']
img = image.load_img(img_path, target_size=(299, 299))
img_path=None
img_data = image.img_to_array(img)
img=None
images.append(img_data)
img_data=None
images[0].shape
Classes=Sample['ClassName']
OutputClasses=Classes.unique().tolist()
labels=Sample['ClassName']
images=np.array(images, dtype="float") / 255.0
(trainX, testX, trainY, testY) = train_test_split(images,labels, test_size=0.10, random_state=42)
trainX, valX, trainY, valY = train_test_split(trainX, trainY, test_size=0.10, random_state=41)
images=None
labels=None
encoder = LabelEncoder()
encoder=encoder.fit(OutputClasses)
encoded_Y = encoder.transform(trainY)
# convert integers to dummy variables (i.e. one hot encoded)
trainY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
encoded_Y = encoder.transform(valY)
# convert integers to dummy variables (i.e. one hot encoded)
valY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
encoded_Y = encoder.transform(testY)
# convert integers to dummy variables (i.e. one hot encoded)
testY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
datagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25)
datagen.fit(trainX,augment=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
batch_size=128
model.fit_generator(datagen.flow(trainX,trainY,batch_size=batch_size), epochs=500,
steps_per_epoch=trainX.shape[0]//batch_size,validation_data=(valX,valY))
The problem I face that the data loaded in one go is too large to fit in current machine memory and so I am unable to work with the complete dataset.
I have tried to work with the datagenerator but do not want to follow he directory conventions it follows and also cannot eradicate the augmentation part.
The question is that is there a way to load batches from the disk ensuring the two stated conditions.
I believe you should have a look at this post
What you are looking for is Keras flow_from_dataframe that let you load the batches from disk by providing the names of your files and their labels in a dataframe and also providing a top directory path that contains all your images.
Making a bit of midifications in your code and borrowing some from the link shared:
MyIndex=pd.read_table('./MySet.txt')
Classes=MyIndex['ClassName']
OutputClasses=Classes.unique().tolist()
trainDf=MyIndex[['ImageName','ClassName']]
train, test = train_test_split(trainDf, test_size=0.10, random_state=1)
#creating a data generator to load the files on runtime
traindatagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25,
validation_split=0.1)
train_generator=traindatagen.flow_from_dataframe(
dataframe=train,
directory=basePath,#the directory containing all your images
x_col='ImageName',
y_col='ClassName',
class_mode='categorical',
target_size=(299, 299),
batch_size=batch_size,
subset='training'
)
#Also a generator for the validation data
val_generator=traindatagen.flow_from_dataframe(
dataframe=train,
directory=basePath,#the directory containing all your images
x_col='ImageName',
y_col='ClassName',
class_mode='categorical',
target_size=(299, 299),
batch_size=batch_size,
subset='validation'
)
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=val_generator.n//val_generator.batch_size
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(generator=train_generator, steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=val_generator,
validation_steps=STEP_SIZE_VALID,
epochs=500)
Also note now you do not need the encoding of the labels as you had in your original code and also omit the image loading code.
I have not tried this code itself so try to fix any bugs you may encounter, as the primary focus was to deliver you the basic idea.
In response to your comment: If you have all files in different directories then one solution would be to have your ImagesName to store the relative path including the intermediate directory in path something like './Dir/File.jpg' and then move all the directories to one folder and use the one as base path and everything else stays the same. Also looking at your code segment that loaded the files look like you already have file paths stored in ImageName column so the suggested approach should work for you.
images=[]
for index, row in MyIndex.iterrows():
img_path=basePath+row['ImageName']
img = image.load_img(img_path, target_size=(299, 299))
img_path=None
img_data = image.img_to_array(img)
img=None
images.append(img_data)
img_data=None
In case if still some ambiguity exists feel free to ask again.
I think the simplest way to do this would be to just load part of your images per each generator and repeatedly call .fit_generator()
with that smaller batch.
The previous version used random.random()
, but we can just as well use a start index and page size like in this revised version to loop over the list of images forever.
import itertools
def load_images(start_index, page_size):
images = []
for index in range(page_size):
# Generate index using modulo to loop over the list forever
index = (start_index + index) % len(rows)
row = MyIndex[index]
img_path = basePath + row["ImageName"]
img = image.load_img(img_path, target_size=(299, 299))
img_data = image.img_to_array(img)
images.append(img_data)
return images
def generate_datagen(batch_size, start_index, page_size):
images = load_images(start_index, page_size)
# ... everything else you need to get from images to trainX and trainY, etc. here ...
datagen = ImageDataGenerator(
rotation_range=90,
horizontal_flip=True,
vertical_flip=True,
width_shift_range=0.25,
height_shift_range=0.25,
)
datagen.fit(trainX, augment=True)
return (
trainX,
trainY,
valX,
valY,
datagen.flow(trainX, trainY, batch_size=batch_size),
)
model.compile(
loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)
page_size = (
500
) # load 500 images at a time; change this as suitable for your memory condition
for page in itertools.count(): # Count from zero to forever.
batch_size = 128
trainX, trainY, valX, valY, generator = generate_datagen(
128, page * page_size, page_size
)
model.fit_generator(
generator,
epochs=5,
steps_per_epoch=trainX.shape[0] // batch_size,
validation_data=(valX, valY),
)
# TODO: add a `break` clause with a suitable condition
If you want to load from the disk it is convenient to do with ImageDataGenerator that you used.
There are two ways to do it. By stating the directory of the data with flow_from_directory
. Alternatively you can use flow_from_dataframe
with Pandas dataframe
If you want to have a list of paths you should not use a custom generator that yields batches of images. Here is a stub:
def load_image_from_path(path):
"Loading and preprocessing"
...
def my_generator():
length = df.shape[0]
for i in range(0, length, batch_size)
batch = df.loc[i:min(i+batch_size, length-1)]
x, y = map(load_image_from_path, batch['ImageName']), batch['ClassName']
yield x, y
Note: in fit_generator there is an additional generator named validation_data
for well you guessed it - validation.
One option is to pass the generators the indices to choose from in order to split train and test (assuming the data is shuffled, if not check this out).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With