I am attempting to predict features in imagery using keras with a TensorFlow backend. Specifically, I am attempting to use a keras ImageDataGenerator
. The model is set to run for 4 epochs and runs fine until the 4th epoch where it fails with a MemoryError.
I am running this model on an AWS g2.2xlarge instance running Ubuntu Server 16.04 LTS (HVM), SSD Volume Type.
The training images are 256x256 RGB pixel tiles (8 bit unsigned) and the training mask is 256x256 single band (8 bit unsigned) tiled data where 255 == a feature of interest and 0 == everything else.
The following 3 functions are the ones pertinent to this error.
How can I resolve this MemoryError?
def train_model():
batch_size = 1
training_imgs = np.lib.format.open_memmap(filename=os.path.join(data_path, 'data.npy'),mode='r+')
training_masks = np.lib.format.open_memmap(filename=os.path.join(data_path, 'mask.npy'),mode='r+')
dl_model = create_model()
print(dl_model.summary())
model_checkpoint = ModelCheckpoint(os.path.join(data_path,'mod_weight.hdf5'), monitor='loss',verbose=1, save_best_only=True)
dl_model.fit_generator(generator(training_imgs, training_masks, batch_size), steps_per_epoch=(len(training_imgs)/batch_size), epochs=4,verbose=1,callbacks=[model_checkpoint])
def generator(train_imgs, train_masks=None, batch_size=None):
# Create empty arrays to contain batch of features and labels#
if train_masks is not None:
train_imgs_batch = np.zeros((batch_size,y_to_res,x_to_res,bands))
train_masks_batch = np.zeros((batch_size,y_to_res,x_to_res,1))
while True:
for i in range(batch_size):
# choose random index in features
index= random.choice(range(len(train_imgs)))
train_imgs_batch[i] = train_imgs[index]
train_masks_batch[i] = train_masks[index]
yield train_imgs_batch, train_masks_batch
else:
rec_imgs_batch = np.zeros((batch_size,y_to_res,x_to_res,bands))
while True:
for i in range(batch_size):
# choose random index in features
index= random.choice(range(len(train_imgs)))
rec_imgs_batch[i] = train_imgs[index]
yield rec_imgs_batch
def train_generator(train_images,train_masks,batch_size):
data_gen_args=dict(rotation_range=90.,horizontal_flip=True,vertical_flip=True,rescale=1./255)
image_datagen = ImageDataGenerator()
mask_datagen = ImageDataGenerator()
# # Provide the same seed and keyword arguments to the fit and flow methods
seed = 1
image_datagen.fit(train_images, augment=True, seed=seed)
mask_datagen.fit(train_masks, augment=True, seed=seed)
image_generator = image_datagen.flow(train_images,batch_size=batch_size)
mask_generator = mask_datagen.flow(train_masks,batch_size=batch_size)
return zip(image_generator, mask_generator)
The following os the output from the model detailing the epochs and the error message:
Epoch 00001: loss improved from inf to 0.01683, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 2/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0049 - binary_crossentropy: 0.0027 - jaccard_coef_int: 0.9983
Epoch 00002: loss improved from 0.01683 to 0.00492, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 3/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0049 - binary_crossentropy: 0.0026 - jaccard_coef_int: 0.9982
Epoch 00003: loss improved from 0.00492 to 0.00488, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 4/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0074 - binary_crossentropy: 0.0042 - jaccard_coef_int: 0.9975
Epoch 00004: loss did not improve
Traceback (most recent call last):
File "image_rec.py", line 291, in <module>
train_model()
File "image_rec.py", line 208, in train_model
dl_model.fit_generator(train_generator(training_imgs,training_masks,batch_size),steps_per_epoch=1,epochs=1,workers=1)
File "image_rec.py", line 274, in train_generator
image_datagen.fit(train_images, augment=True, seed=seed)
File "/home/ubuntu/pyvirt_test/local/lib/python2.7/site-packages/keras/preprocessing/image.py", line 753, in fit
x = np.copy(x)
File "/home/ubuntu/pyvirt_test/local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1505, in copy
return array(a, order=order, copy=True)
MemoryError
You provided quite confusing code (in my opinion), ie. no call to the train_generator
is visible. I am not sure that this is a problem of insufficient memore due to a big data, since you use memmap for that, but lets assume for now it is.
ImageDataGenerator
's flow_from_directory
method.
It would require a slight change of design, tho, which might not be what you want.You can load it in the following manner:
train_datagen = ImageDataGenerator()
train_generator = train_datagen.flow_from_directory(
'data/train',
target_size=(256, 256),
batch_size=batch_size,
... # other configurations)
More on that in the Keras documentation.
Also note that if you have 32bit, the memmap
does not allow more than 2GB.
Do you use tensorflow-gpu
, by any chance? Maybe your gpu is not sufficient, you could try this with the tensorflow
package.
I would strongly suggest to try some memory profiling to see where bigger allocations of memory happen.
If it was not the case of insufficient memory, It might be wrong handling of the data in your model, since your loss function is not improving at all, it could be miswired for example.
Finally, the last note here .. it is good practice to load the memmap of training data as read-only
, since you don't want to accidentaly mess the data.
UPDATE: I can see that you've updated the post and provided the code for the train_generator
method, but there is still no call to that method in your call.
If I assume that you have a typo in the call - train_generator
instead of the generator
method in your d1_model.fit_generator
method, it is possible that the fit_generator
method is not working on a batch of data, but actually on the whole training_imgs
and it copys over the whole set in the np.copy(x)
call.
Also, as mentioned already, there indeed are (you can find some of them, fe. here is an open one) a few issues with Keras memory leak when using the fit
and fit_generator
methods.
it seems your problem is due to the data is too huge. I can see two solutions. The first one is run your code in a distributed system by means of spark, I guess you do not have this support, so let us move on to the other.
The second one is which I think is viable. I would slice the data and I would try feeding the model incrementally. We can do this with Dask. This library can slice the data and save in objects which then you can retrieve reading from disk, only in the part you want.
If you have a image which size is an matrix of 100x100, we can retrieve each array without the needed to load the 100 arrays in memory. We can load array by array in memory (releasing the previous one), which would be the input in your Neural Network.
To do this, you can to transform your np.array to dask array and assign the partitions. For example:
>>> k = np.random.randn(10,10) # Matrix 10x10
>>> import dask.array as da
>>> k2 = da.from_array(k,chunks = 3)
dask.array<array, shape=(10, 10), dtype=float64, chunksize=(3, 3)>
>>> k2.to_delayed()
array([[Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 0)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 1)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 2)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 3))],
[Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 0)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 1)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 2)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 3))],
[Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 0)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 1)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 2)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 3))],
[Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 0)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 1)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 2)),
Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 3))]],
dtype=object)
Here, you can see how the data is saved in objects, and then you can retrieve in parts to feed your model.
To implement this solution you must introduce a loop in your function which call each partition and feed the NN to get the incremental trainning.
For more information, see Dask documentation
This is common when running 32bit if the float precision is too high. Are you running 32bit? You may also consider casting or rounding the array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With