Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow: Dst tensor is not initialized

Tags:

tensorflow

The MNIST For ML Beginners tutorial is giving me an error when I run print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})). Everything else runs fine.

Error and trace:

InternalErrorTraceback (most recent call last)
<ipython-input-16-219711f7d235> in <module>()
----> 1 print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
    338     try:
    339       result = self._run(None, fetches, feed_dict, options_ptr,
--> 340                          run_metadata_ptr)
    341       if run_metadata:
    342         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
    562     try:
    563       results = self._do_run(handle, target_list, unique_fetches,
--> 564                              feed_dict_string, options, run_metadata)
    565     finally:
    566       # The movers are no longer used. Delete them.

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
    635     if handle is None:
    636       return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
--> 637                            target_list, options, run_metadata)
    638     else:
    639       return self._do_call(_prun_fn, self._session, handle, feed_dict,

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
    657       # pylint: disable=protected-access
    658       raise errors._make_specific_exception(node_def, op, error_message,
--> 659                                             e.code)
    660       # pylint: enable=protected-access
    661 

InternalError: Dst tensor is not initialized.
     [[Node: _recv_Placeholder_3_0/_1007 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_312__recv_Placeholder_3_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
     [[Node: Mean_1/_1011 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_319_Mean_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

I just switched to a more recent version of CUDA, so maybe this has something to do with that? Seems like this error is about copying a tensor to the GPU.

Stack: EC2 g2.8xlarge machine, Ubuntu 14.04

UPDATE:

print(sess.run(accuracy, feed_dict={x: batch_xs, y_: batch_ys})) runs fine. This leads me to suspect that the issue is that I'm trying to transfer a huge tensor to the GPU and it can't take it. Small tensors like a minibatch work just fine.

UPDATE 2:

I've figured out exactly how big the tensors have to be to cause this issue:

batch_size = 7509 #Works.
print(sess.run(accuracy, feed_dict={x: mnist.test.images[0:batch_size], y_: mnist.test.labels[0:batch_size]}))

batch_size = 7510 #Doesn't work. Gets the Dst error.
print(sess.run(accuracy, feed_dict={x: mnist.test.images[0:batch_size], y_: mnist.test.labels[0:batch_size]}))
like image 809
rafaelcosman Avatar asked May 19 '16 04:05

rafaelcosman


3 Answers

For brevity, this error message is generated when there is not enough memory to handle the batch size.

Expanding on Steven's link (I cannot post comments yet), here are a few tricks to monitor/control memory usage in Tensorflow:

  • To monitor memory usage during runs, consider logging run metadata. You can then see the memory usage per node in your graph in Tensorboard. See the Tensorboard information page for more information and an example of this.
  • By default, Tensorflow will try to allocate as much GPU memory as possible. You can change this using the GPUConfig options, so that Tensorflow will only allocate as much memory as needed. See the documentation on this. There you also find an option that will allow you to only allocate a certain fraction of your GPU memory (I have found this to be broken sometimes though.).
like image 143
Lars Mennen Avatar answered Oct 22 '22 19:10

Lars Mennen


I think that this link can help https://github.com/aymericdamien/TensorFlow-Examples/issues/38#issuecomment-223793214. This in my case was the GPU that was busy (93% busy) with training another model in a screen. I needed to kill that process, and was happy later on to see stuff working.

like image 7
Julien Nyambal Avatar answered Oct 22 '22 20:10

Julien Nyambal


Keep in mind the ec2 g2.8xlarge only has 4 gb of gpu memory.
https://aws.amazon.com/ec2/instance-types/

I don't have a good way to find out how much space the model takes up other than running it with a batch size of 1 then you can subtract out how much space one image takes up.

From there you can determine your maximum batch size. This should work but I think tensorflow allocates gpu memory dynamically similar to torch and unlike caffe which will block off the max gpu space it requires from the get go. So you would probably want to be conservative with the maximum batch size.

like image 6
Steven Avatar answered Oct 22 '22 20:10

Steven