Logo Questions Linux Laravel Mysql Ubuntu Git Menu

tf.reduce_sum on GPU fails in combination with placeholder as input shape


UPDATE: Fixed in Tensorflow 1.14.0 (maybe earlier, didn't check)

UPDATE: Still occurring in Tensorflow 1.7.0

UPDATE: I wrote a collab notebook that reproduces this bug on google's gpu hardware: https://drive.google.com/file/d/13V87kSTyyFVMM7NoJNk9QTsCYS7FRbyz/view?usp=sharing

UPDATE: After wrongly accusing tf.gather in the first revisions of this question I now narrowed it down to tf.reduce_sum in combination with a placeholder as shape:

tf.reduce_sum produces zeros (on GPU only) for large tensors whose shape depends on a placeholder.

Running the following code while feeding a large integer to placeholder batch_size (>700000 in my case):

import tensorflow as tf
import numpy as np

graph = tf.Graph()
with graph.as_default():
    batch_size = tf.placeholder(tf.int32,shape=[])
    ones_with_placeholder = tf.ones([batch_size,256,4])
    sum_out = tf.reduce_sum(ones_with_placeholder,axis=2)
    min_sum_out = tf.reduce_min(sum_out)

sess = tf.Session(graph=graph)

sum_result,min_sum_result = sess.run([sum_out,min_sum_out],feed_dict={batch_size: 1000000})
print("Min value in sum_out processed on host with numpy:", np.min(sum_result))
print("Min value in sum_out tensor processed in graph with tf:", min_sum_result)

The following, wrong result is shown:

Min value in sum_out processed on host with numpy: 0.0
Min value in sum_out tensor processed in graph with tf: 0.0

I was expecting that applying reduce_sum over axis 2 should result in 4.0 everywhere!

Running this exact code on CPU leads to correct results. Also running this with a fixed shape for tf.ones leads to the correct results on both CPU and GPU:

ones_with_fixed_shape = tf.ones([1000000,256,4])
sum_out = tf.reduce_sum(ones_with_fixed_shape,axis=2)

What is the problem with the placeholder on GPU?

like image 682
sdnr Avatar asked Mar 27 '18 18:03


1 Answers

The basic problem is that there's a speed/accuracy tradeoff. Even though your example seems trivial, with the entire tensor initialized to 1, there are 1.024B entries. Note that int32 can represent integral numbers in the range [-2,147,483,648 to 2,147,483,647] without loss of precison:

So we expect to see some error if we accumulate all of the entries and perform computation. This also explains why smaller matrices didn't exhibit the problem(smaller Batch size).

like image 74
Prakhar Agarwal Avatar answered Oct 14 '22 00:10

Prakhar Agarwal