Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow Convolution Neural Net - Training with a small dataset, applying random changes to Images

Say I have a very small dataset, just 50 Images. I want to re-use the code from the tutorial at Red Pill, but apply random transformations to the same set of Images in each Batch of training, say random changes to Brightness, Contrast etc. I added just one function:

def preprocessImages(x):
    retValue = numpy.empty_like(x)
    for i in range(50):
        image = x[i]
        image = tf.reshape(image, [28,28,1])
        image = tf.image.random_brightness(image, max_delta=63)
        #image = tf.image.random_contrast(image, lower=0.2, upper=1.8)
        # Subtract off the mean and divide by the variance of the pixels.
        float_image = tf.image.per_image_whitening(image)
        float_image_Mat = sess.run(float_image)
        retValue[i] = float_image_Mat.reshape((28*28))
    return retValue

Small change to the old code:

batch = mnist.train.next_batch(50)
for i in range(1000):
  #batch = mnist.train.next_batch(50)
  if i%100 == 0:
    train_accuracy = accuracy.eval(feed_dict={
        x:preprocessImages(batch[0]), y_: batch[1], keep_prob: 1.0})
    print("step %d, training accuracy %g"%(i, train_accuracy))
  train_step.run(feed_dict={x: preprocessImages(batch[0]), y_: batch[1], keep_prob: 0.5})

First iteration is successful, thereafter it crashes:

step 0, training accuracy 0.02
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_12_grad/Relu_12/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_16)]]
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_13_grad/Relu_13/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_17)]]
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_14_grad/Relu_14/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_18)]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 69, in <module>
    train_step.run(feed_dict={x: preprocessImages(batch[0]), y_: batch[1], keep_prob: 0.5})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1267, in run
    _run_using_default_session(self, feed_dict, self.graph, session)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2763, in _run_using_default_session
    session.run(operation, feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 345, in run
    results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 419, in _do_run
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_12_grad/Relu_12/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_16)]]
Caused by op u'gradients_4/Relu_12_grad/Relu_12/CheckNumerics', defined at:
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 58, in <module>
    train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 165, in minimize
    gate_gradients=gate_gradients)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 205, in compute_gradients
    loss, var_list, gate_gradients=(gate_gradients == Optimizer.GATE_OP))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 414, in gradients
    in_grads = _AsList(grad_fn(op_wrapper, *out_grads))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 107, in _ReluGrad
    t = _VerifyTensor(op.inputs[0], op.name, "ReluGrad input is not finite.")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 100, in _VerifyTensor
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 48, in check_numerics
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'Relu_12', defined at:
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 34, in <module>
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 506, in relu
    return _op_def_lib.apply_op("Relu", features=features, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

This is exactly the same error that I get with my personal dataset with 50 training examples.

like image 851
user2849678 Avatar asked Dec 06 '15 10:12

user2849678


1 Answers

One thing to start with: Instead of computing y_conv and then the cross-entropy, use the merged tf.softmax_cross_entropy_with_logits operator. This may not solve your problem, but it's more numerically stable than the naive version in the Red Pill example.

Second, try printing out the cross_entropy at every iteration.

cross_entropy = .... (previous code here)
cross_entropy = tf.Print(cross_entropy, [cross_entropy], "Cross-entropy: ")

to get an idea if it's going to infinity as the model progresses, or if it just jumps to inf or NaN. If it progressively blows up, then it's probably the learning rate. If it jumps, it could be a numerical boundary condition that could be solved as above. If it's there from the get-go, you may have an error in the way you're applying distortions that ends up feeding in horribly broken data in some way.

like image 59
dga Avatar answered Oct 01 '22 22:10

dga