The MNIST For ML Beginners
tutorial is giving me an error when I run print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
. Everything else runs fine.
Error and trace:
InternalErrorTraceback (most recent call last)
<ipython-input-16-219711f7d235> in <module>()
----> 1 print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
338 try:
339 result = self._run(None, fetches, feed_dict, options_ptr,
--> 340 run_metadata_ptr)
341 if run_metadata:
342 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
562 try:
563 results = self._do_run(handle, target_list, unique_fetches,
--> 564 feed_dict_string, options, run_metadata)
565 finally:
566 # The movers are no longer used. Delete them.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
635 if handle is None:
636 return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
--> 637 target_list, options, run_metadata)
638 else:
639 return self._do_call(_prun_fn, self._session, handle, feed_dict,
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
657 # pylint: disable=protected-access
658 raise errors._make_specific_exception(node_def, op, error_message,
--> 659 e.code)
660 # pylint: enable=protected-access
661
InternalError: Dst tensor is not initialized.
[[Node: _recv_Placeholder_3_0/_1007 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_312__recv_Placeholder_3_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
[[Node: Mean_1/_1011 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_319_Mean_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
I just switched to a more recent version of CUDA, so maybe this has something to do with that? Seems like this error is about copying a tensor to the GPU.
Stack: EC2 g2.8xlarge machine, Ubuntu 14.04
UPDATE:
print(sess.run(accuracy, feed_dict={x: batch_xs, y_: batch_ys}))
runs fine. This leads me to suspect that the issue is that I'm trying to transfer a huge tensor to the GPU and it can't take it. Small tensors like a minibatch work just fine.
UPDATE 2:
I've figured out exactly how big the tensors have to be to cause this issue:
batch_size = 7509 #Works.
print(sess.run(accuracy, feed_dict={x: mnist.test.images[0:batch_size], y_: mnist.test.labels[0:batch_size]}))
batch_size = 7510 #Doesn't work. Gets the Dst error.
print(sess.run(accuracy, feed_dict={x: mnist.test.images[0:batch_size], y_: mnist.test.labels[0:batch_size]}))
For brevity, this error message is generated when there is not enough memory to handle the batch size.
Expanding on Steven's link (I cannot post comments yet), here are a few tricks to monitor/control memory usage in Tensorflow:
I think that this link can help https://github.com/aymericdamien/TensorFlow-Examples/issues/38#issuecomment-223793214.
This in my case was the GPU that was busy (93% busy) with training another model in a screen
. I needed to kill that process, and was happy later on to see stuff working.
Keep in mind the ec2 g2.8xlarge only has 4 gb of gpu memory.
https://aws.amazon.com/ec2/instance-types/
I don't have a good way to find out how much space the model takes up other than running it with a batch size of 1 then you can subtract out how much space one image takes up.
From there you can determine your maximum batch size. This should work but I think tensorflow allocates gpu memory dynamically similar to torch and unlike caffe which will block off the max gpu space it requires from the get go. So you would probably want to be conservative with the maximum batch size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With