What does the TensowFlow GradientDescentOptimizer do in this example?

Tags:

tensorflow

I'm trying to do Stanfords CS20: TensorFlow for Deep Learning Research course. The first 2 lectures provide a good introduction to the low level plumbing and computation framework (that frankly the official introductory tutorials seem to skip right over for reasons I can only fathom as sadism). In lecture 3, it starts performing a linear regression and makes what seems like a fairly heavy cognitive leap for me. Instead of session.run on a tensor computation, it does it on the GradientDescentOptimizer.

sess.run(optimizer, feed_dict={X: x, Y:y})

The full code is available on page 3 of the lecture 3 notes.

EDIT: code and data also available at this github - code is available in examples/03_linreg_placeholder.py and data in examples/data/birth_life_2010.txt

EDIT: code is below as per request

import tensorflow as tf

import utils

DATA_FILE = "data/birth_life_2010.f[txt"

# Step 1: read in data from the .txt file
# data is a numpy array of shape (190, 2), each row is a datapoint
data, n_samples = utils.read_birth_life_data(DATA_FILE)

# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')

# Step 3: create weight and bias, initialized to 0
w = tf.get_variable('weights', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))

# Step 4: construct model to predict Y (life expectancy from birth rate)
Y_predicted = w * X + b 

# Step 5: use the square error as the loss function
loss = tf.square(Y - Y_predicted, name='loss')

# Step 6: using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

with tf.Session() as sess:
    # Step 7: initialize the necessary variables, in this case, w and b
    sess.run(tf.global_variables_initializer()) 

    # Step 8: train the model
    for i in range(100): # run 100 epochs
        for x, y in data:
            # Session runs train_op to minimize loss
            sess.run(optimizer, feed_dict={X: x, Y:y}) 

    # Step 9: output the values of w and b
    w_out, b_out = sess.run([w, b])

I've done the coursera machine learning course course so I (think) I understand the notion of Gradient Descent. But I'm quite lost as to what is happening in this specific case.

What I would expect to have to happen:

Calculate the gradient (either by calculus or numerical methods)
Calculate the parameter change (alpha multiplied by the predicted vs actual over the entire dataset)
Adjust the parameters
Repeat the above N times (in this case 100 times for 100 epochs)

I understand that in practice you'd apply things like batching and subsets but in this case I believe this is just looping over the entire dataset 100 times.

I can (and have) implemented this before. But I'm struggling to fathom how the code above could be achieving this. For one thing is the optimizer is called on each data point (i.e. it's in an inner loop of the 100 epochs and then each data point). I would have expected an optimization call which took in the entire dataset.

Question 1 - is the gradient adjustment operating over the entire data set 100 times, or over the entire data set 100 times in batches of 1 (so 100*n times, for n examples)?

Question 2 - how does the optimizer 'know' how to to adjust w and b? It's only provided the loss tensor - is it reading back through the graph and just going "well, w and b are the only variables, so I'll wiggle the hell out of those"

Question 2b - if so, what happens if you put in other variables? Or more complex functions? Does it just auto-magically calculate gradient adjustment for every variable in the predecessor graph**

Question 2c - pursuant to that I've tried adjusting to a quadratic expression as suggested in page 3 of the tutorial but end up getting a higher loss. Is this normal? The tutorial seems to suggest it should be better. At the least I would expect it not to be worse - is this subject to changing hyperparameters?

EDIT: Full code for my attempts to adjust to quadratic are here. Not that this is the same as the above with lines 28, 29, 30 and 34 modified to use a quadratic predictor. These edits are (what I interpret) to be what's suggested in the lecture 3 notes on page 4

""" Solution for simple linear regression example using placeholders
Created by Chip Huyen ([email protected])
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import utils

DATA_FILE = 'data/birth_life_2010.txt'

# Step 1: read in data from the .txt file
data, n_samples = utils.read_birth_life_data(DATA_FILE)

# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')

# Step 3: create weight and bias, initialized to 0
# w = tf.get_variable('weights', initializer=tf.constant(0.0)) old single weight
w = tf.get_variable('weights_1', initializer=tf.constant(0.0))
u = tf.get_variable('weights_2', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))

# Step 4: build model to predict Y
#Y_predicted = w * X + b  #linear
Y_predicted = w * X * X + X * u + b  #quadratic
#Y_predicted = w  # test of nonsense


# Step 5: use the squared error as the loss function
# you can use either mean squared error or Huber loss
loss = tf.square(Y - Y_predicted, name='loss')
#loss = utils.huber_loss(Y, Y_predicted)

# Step 6: using gradient descent with learning rate of 0.001 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)


start = time.time()
writer = tf.summary.FileWriter('./graphs/linear_reg', tf.get_default_graph())
with tf.Session() as sess:
    # Step 7: initialize the necessary variables, in this case, w and b
    sess.run(tf.global_variables_initializer()) 

    # Step 8: train the model for 100 epochs
    for i in range(100): 
        total_loss = 0
        for x, y in data:
            # Session execute optimizer and fetch values of loss
            _, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y}) 
            total_loss += l
        print('Epoch {0}: {1}'.format(i, total_loss/n_samples))

    # close the writer when you're done using it
    writer.close() 

    # Step 9: output the values of w and b
    w_out, b_out = sess.run([w, b]) 

print('Took: %f seconds' %(time.time() - start))
print(f'w = {w_out}')

# plot the results
plt.plot(data[:,0], data[:,1], 'bo', label='Real data')
plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')
plt.legend()
plt.show()

For the linear predictor I get loss of (this aligns with lecture notes):

Epoch 99: 30.03552558278714

For my attempts at the quadratic I get loss of:

Epoch 99: 127.2992221294363

305

asked Mar 22 '18 17:03

fostandy

1 Answers

In the code you linked, it's 100 epochs in batches of 1 (assuming each element of data is a single input). I.e. compute the gradient of the loss with respect to a single example, update the parameters, go to the next example... until you went over the whole dataset. Do this 100 times.
A lot of things happen in that minimize call of the optimizer. Indeed, you only put in the cost: Under the hood, Tensorflow will then compute gradients for all requested variables (we'll get to that in a second) that are involved in the cost computation (it can infer this from the computational graph) and return an op that "applies" the gradients. This means an op that takes all the requested variables and assigns a new value to them, something like tf.assign(var, var - learning_rate*gradient). This is related to another question you asked: minimize returns just an op, this doesn't do anything! Running this op in a session will do a "gradient step" each time.

As to which variables are actually affected by this op: You can give this as an argument to the minimize call! See here -- the argument is var_list. If this is not given, Tensorflow will simply use all "trainable variables". By default, any variable you create with tf.Variable or tf.get_variable is trainable. However you can pass trainable=False to these functions to create variables that are not (by default) going to be affected by the op returned by minimize. Play around with this! See what happens if you set some variables not to be trainable, or if you pass a custom var_list to minimize.
In general, the "whole idea" of Tensorflow is that it can "magically" calculate gradients based on only a feedforward description of the model.
EDIT: This is possible because machine learning models (including deep learning) are composed of quite simple building blocks such as matrix multiplications and mostly pointwise nonlinearities. These simple blocks also have simple derivatives, which can be composed via the chain rule. You might want to read up on the backpropagation algorithm.
It will certainly take longer with very large models. But it is always possible as long as there is a clear "path" through the computation graph where all components have defined derivatives.
As to whether this can generate poor models: Yes, and this is a fundamental problem of deep learning. Very complex/deep models lead to highly non-convex cost functions which are difficult to optimize with methods like gradient descent.

With regards to the quadratic function: Looks like there are two problems here.

Not enough training epochs. More complex problems (in this case, we have more variables) might simply need longer to train. E.g. with your setup I can reach a cost of ~58 after about 330 epochs with the quadratic function.
The learning rate. The above is still suspicious since with more variables we should definitely be able to reach better results (as long as the inputs for those variables aren't superfluous), and since this is a simple linear regression problem gradient descent should be able to find them. In this case the learning rate is usually the problem. I changed it to 0.0001 (lowered by a factor of 10) and after about 3400 epochs reach costs below 30 (haven't tested how low it goes). Now obviously lower learning rates lead to slower training, but they are often necessary towards the end to avoid "jumping over" better solutions. This is why in practice, some kind of learning rate annealing is usually performed -- start with a large learning rate to make rapid progress in the beginning, then make it smaller and smaller as training progresses. In general, the learning rate (and its annealing schedule) is the hyperparameter that needs the most tuning in machine learning problems.
There are also methods such as Adam that use an "adaptive" learning rate. Generally an untuned adaptive method will outperform an untuned gradient descent, so they are good for quick experiments. However, well-tuned gradient descent will usually outperform them in turn.

121

answered Nov 09 '22 14:11

xdurch0

Related questions
                            
                                TensorFlow dynamic_rnn regressor: ValueError dimension mismatch
                            
                                How to use the pre-trained ResNet50 in tensorflow?
                            
                                Tensorboard Cannot find .runfiles directory error
                            
                                Installing tensorflow on windows
                            
                                How do I compute the KL divergence in Keras with TensorFlow backend?
                            
                                How to calculate vector-wise dot product in Keras?
                            
                                dump weights of cnn in json using keras
                            
                                Calculating gradient norm wrt weights with keras
                            
                                Tensorboard: File system scheme gs not implemented
                            
                                what's a good ratio of parameter servers to masters in distributed tensorflow?
                            
                                Number of CNN learnable parameters - Python / TensorFlow
                            
                                Limit GPU devices in Tensorflow
                            
                                What does 'Off' mean in the output of nvidia-smi?
                            
                                import_meta_graph fails with Data loss: not an sstable (bad magic number)
                            
                                Using tensorflow's Dataset pipeline, how do I *name* the results of a `map` operation?
                            
                                Dataset API does not pass dimensionality information for its output tensor when using py_func
                            
                                How to print the gradients during training in Tensorflow?
                            
                                Input multiple files into Tensorflow dataset
                            
                                Simple example of CuDnnGRU based RNN implementation in Tensorflow
                            
                                tensors are from different graphs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With