Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed Tensorflow: check failed: size>=0

I'm using keras 2.0.6. The version of tensorflow is 1.3.0.

My code can run with theano backend, but failed with tensorflow backend:

F tensorflow/core/framework/tensor_shape.cc:241] Check failed: size >= 0 (-14428307456 vs. 0)

I was wondering if anyone can thought of any possible reason that might cause this.

Thank you!

----UPDATE-----

I tested exactly the same code on my PC with tensorflow. It runs perfectly.

However, it throw out this error when I run it on a Supercomputer.

Although this error looks like overflow, there is no way that it didn't overflow on my PC, but overflow on a supercomputer.

I suspect that it comes from a bug on tensorflow for distributed computation.

like image 594
volcanofly Avatar asked Jul 31 '17 18:07

volcanofly


2 Answers

I came across the same bug, but Tensorflow ran ok after that I shrank the batch size.

I think the reason is the GPU running out of memory.

like image 200
bai wenjie Avatar answered Oct 03 '22 02:10

bai wenjie


I had met the error, in my issue, the error is coming from TF with different vision.

the error is solved. the model was trained in tf 1.15, but frozen the model in tf 1.13. When froze it in tf 1.15, everything is ok.

I think you can check the model version.

like image 39
colten Avatar answered Oct 03 '22 03:10

colten