While running kubeflow pipeline having code that uses tensorflow 2.0. below error is displayed at end of each epoch
W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Also, after some epochs, it does not show log and shows this error
This step is in Failed state with this message: The node was low on resource: memory. Container main was using 100213872Ki, which exceeds its request of 0. Container wait was using 25056Ki, which exceeds its request of 0.
Upgrading tensorflow
from 2.1
to 2.2
fixed this issue for me. I didn't have to go to tf-nightly
version.
In my case, I didn't match the batch_size
and steps_per_epoch
For example,
his = Test_model.fit_generator(datagen.flow(trainrancrop_images, trainrancrop_labels, batch_size=batchsize),
steps_per_epoch=len(trainrancrop_images)/batchsize,
validation_data=(test_images, test_labels),
epochs=1,
callbacks=[callback])
batch_size
in the datagen.flow must correspond to the steps_per_epoch
in Test_model.fit_generator
(actually, I used the wrong value on the steps_per_epoch
)
This is one of the cases for the Error, I guess.
As a result, I think the problem arises when there is wrong correspondence on the batch size and steps(iterations)
Maybe the floats can be a problem when you get the step by dividing...
Check your code about this issue.
Good luck :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With