Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow or Theano: how do they know the loss function derivative based on the neural network graph?

In TensorFlow or Theano, you only tell the library how your neural network is, and how feed-forward should operate.

For instance, in TensorFlow, you would write:

with graph.as_default():
    _X = tf.constant(X)
    _y = tf.constant(y)

    hidden = 20
    w0 = tf.Variable(tf.truncated_normal([X.shape[1], hidden]))
    b0 = tf.Variable(tf.truncated_normal([hidden]))

    h = tf.nn.softmax(tf.matmul(_X, w0) + b0)

    w1 = tf.Variable(tf.truncated_normal([hidden, 1]))
    b1 = tf.Variable(tf.truncated_normal([1]))

    yp = tf.nn.softmax(tf.matmul(h, w1) + b1)

    loss = tf.reduce_mean(0.5*tf.square(yp - _y))
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

I am using L2-norm loss function, C=0.5*sum((y-yp)^2), and in the backpropagation step presumably the derivative will have to be computed, dC=sum(y-yp). See (30) in this book.

My question is: how can TensorFlow (or Theano) know the analytical derivative for backpropagation? Or do they do an approximation? Or somehow do not use the derivative?

I have done the deep learning udacity course on TensorFlow, but I am still at odds at how to make sense on how these libraries work.

like image 844
Ricardo Magalhães Cruz Avatar asked Feb 11 '16 15:02

Ricardo Magalhães Cruz


1 Answers

The differentiation happens in the final line:

    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

When you execute the minimize() method, TensorFlow identifies the set of variables on which loss depends, and computes gradients for each of these. The differentiation is implemented in ops/gradients.py, and it uses "reverse accumulation". Essentially it searches backwards from the loss tensor to the variables, applying the chain rule at each operator in the dataflow graph. TensorFlow includes "gradient functions" for most (differentiable) operators, and you can see an example of how these are implemented in ops/math_grad.py. A gradient function can use the original op (including its inputs, outputs, and attributes) and the gradients computed for each of its outputs to produce gradients for each of its inputs.

Page 7 of Ilya Sutskever's PhD thesis has a nice explanation of how this process works in general.

like image 94
mrry Avatar answered Oct 14 '22 11:10

mrry