In tensorflow it seems that the entire backpropagation algorithm is performed by a single running of an optimizer on a certain cost function, which is the output of some MLP or a CNN. I do not fully understand how tensorflow knows from the cost that it is indeed an output of a certain NN? A cost function can be defined for any model. How should I "tell" it that a certain cost function derives from a NN?

The backpropagation was created by Rumelhart and Hinton et al and published on Nature in 1986. As stated in section 6.5: Back-Propagation and Other DiﬀerentiationAlgorithms of the deeplearning book there are two types of approaches for back-propagation gradients through computational graphs: symbol-to-number differentiation and symbol to symbol derivatives. The more relevant one to Tensorflow as stated in this paper: A Tour of TensorFlow is the later which can be illustrated using this diagram: <img src="https://i.stack.imgur.com/f0THJ.png" alt="enter image description here"> Source: Section II Part D of A Tour of TensorFlow In left side of the the Fig. 7 above, w represents the weights(or Variables) in Tensorflow and x and y are two intermediary operations(or nodes, w, x, y and z are all operations) to get the scalar loss z. Tensorflow will add a node to each node(if we print the names of variables in a certain checkpoint we can see some additional variables for such nodes and they will be eliminated if we freeze the model to a protocol buffer file for deployment) in the graph for the gradient which can be seen in diagram (b) on the right side: dz/dy, dy/dx, dx/dw. During the traversal of the back propagation at each node we multiply its gradient with that of the previous one and finally to get a symbolic handle to the overall target derivative dz/dw = dz/dy * dy/dx * dx/dw, which applies exactly the chain rule. Once the gradient is worked out w can update itself with a learning rate. For more detailed information please read this paper: <a href="https://arxiv.org/pdf/1603.04467.pdf" rel="nofollow noreferrer">TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems</a>

How do backpropagation works in tensorflow

Tags:

tensorflow

In tensorflow it seems that the entire backpropagation algorithm is performed by a single running of an optimizer on a certain cost function, which is the output of some MLP or a CNN.

I do not fully understand how tensorflow knows from the cost that it is indeed an output of a certain NN? A cost function can be defined for any model. How should I "tell" it that a certain cost function derives from a NN?

897

asked May 26 '17 21:05

Ezer Miller

2 Answers

Question

How should I "tell" tf that a certain cost function derives from a NN?

(short) Answer

This is done by simply configuring your optimizer to minimize (or maximize) a tensor. For example, if I have a loss function like so

loss = tf.reduce_sum( tf.square( y0 - y_out ) )

where y0 is the ground truth (or desired output) and y_out is the calculated output, then I could minimize the loss by defining my training function like so

train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

This tells Tensorflow that when train is calculated, it is to apply gradient descent on loss to minimize it, and loss is calculated using y0 and y_out, and so gradient descent will also affect those (if they are trainable variables), and so on.

The variable y0, y_out, loss, and train are not standard python variables but instead descriptions of a computation graph. Tensorflow uses information about that computation graph to unroll it while applying gradient descent.

Specifically how it does that is beyond the scope of this answer. Here and here are two good starting points for more information about more specifics.

Code Example

Let's walk through a code example. First the code.

### imports
import tensorflow as tf

### constant data
x  = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]

### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output

# Layer 0 = the x2 inputs
x0 = tf.constant( x  , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )

# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32  ))
b1 = tf.Variable( tf.random_uniform( [3]   , minval=0.1 , maxval=0.9 , dtype=tf.float32  ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )

# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32  ))
b2 = tf.Variable( tf.random_uniform( [1]   , minval=0.1 , maxval=0.9 , dtype=tf.float32  ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )


### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )

# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)


### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
  sess.run( tf.global_variables_initializer() )
  for step in range(500) :
    sess.run(train)

  results = sess.run([m1,b1,m2,b2,y_out,loss])
  labels  = "m1,b1,m2,b2,y_out,loss".split(",")
  for label,result in zip(*(labels,results)) :
    print ""
    print label
    print result

print ""

Let's go through it, but in reverse order starting with

sess.run(train)

This tells tensorflow to look up the graph node defined by train and calculate it. Train is defined as

train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

To calculate this tensorflow must compute the automatic differentiation for loss, which means walking the graph. loss is defined as

loss = tf.reduce_sum( tf.square( y0 - y_out ) )

Which is really tensorflow applying automatic differentiation to unroll first tf.reduce_sum, then tf.square, then y0 - y_out, which leads to then having to walk the graph for both y0 and y_out.

y0 = tf.constant( y_ , dtype=tf.float32 )

y0 is a constant and will not be updated.

y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )

y_out will be processed similar to loss, first tf.sigmoid will be processed, etc...

All in all, each operation ( such as tf.sigmoid, tf.square ) not only defines the forward operation ( apply sigmoid or square ) but also information necessary for automatic differentiation. This is different than standard python math such as

x = 7 + 9

The above equation encodes nothing except how to update x, where as

z = y0 - y_out

encodes the graph of subtracting y_out from y0 and stores both the forward operation and enough to do automatic differentiation in z

159

answered Sep 22 '22 10:09

Anton Codes

The backpropagation was created by Rumelhart and Hinton et al and published on Nature in 1986.

As stated in section 6.5: Back-Propagation and Other DiﬀerentiationAlgorithms of the deeplearning book there are two types of approaches for back-propagation gradients through computational graphs: symbol-to-number differentiation and symbol to symbol derivatives. The more relevant one to Tensorflow as stated in this paper: A Tour of TensorFlow is the later which can be illustrated using this diagram:

enter image description here

Source: Section II Part D of A Tour of TensorFlow

In left side of the the Fig. 7 above, w represents the weights(or Variables) in Tensorflow and x and y are two intermediary operations(or nodes, w, x, y and z are all operations) to get the scalar loss z.

Tensorflow will add a node to each node(if we print the names of variables in a certain checkpoint we can see some additional variables for such nodes and they will be eliminated if we freeze the model to a protocol buffer file for deployment) in the graph for the gradient which can be seen in diagram (b) on the right side: dz/dy, dy/dx, dx/dw.

During the traversal of the back propagation at each node we multiply its gradient with that of the previous one and finally to get a symbolic handle to the overall target derivative dz/dw = dz/dy * dy/dx * dx/dw, which applies exactly the chain rule. Once the gradient is worked out w can update itself with a learning rate.

For more detailed information please read this paper: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

answered Sep 24 '22 10:09

Lerner Zhang

Related questions
                            
                                Initializing LSTM hidden state Tensorflow/Keras
                            
                                TensorFlow: How to predict from a SavedModel?
                            
                                Checkpointing keras model: TypeError: can't pickle _thread.lock objects
                            
                                Is there a way to stack two tensorflow datasets?
                            
                                Tensorflow: Confusion regarding the adam optimizer
                            
                                Is there a built-in KL divergence loss function in TensorFlow?
                            
                                How tf.transpose works in tensorflow?
                            
                                Tensorflow on windows - ImportError: DLL load failed: The specified module could not be found
                            
                                KerasRegressor Coefficient of Determination R^2 Score
                            
                                Issue NaN with Adam solver
                            
                                tensorflow.train.import_meta_graph does not work?
                            
                                why tensorflow just outputs killed
                            
                                Holding variables constant during optimizer
                            
                                load weights require h5py
                            
                                How to make a 2D Gaussian Filter in Tensorflow?
                            
                                Tensorflow Tensor reshape and pad with zeros
                            
                                How do I pass a scalar via a TensorFlow feed dictionary
                            
                                tensorflow scalar summary tags name exception
                            
                                TensorFlow: Saver has 5 models limit
                            
                                upgrade tensorflow on windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With