Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement multivariate linear stochastic gradient descent algorithm in tensorflow?

I started with simple implementation of single variable linear gradient descent but don't know to extend it to multivariate stochastic gradient descent algorithm ?

Single variable linear regression

import tensorflow as tf
import numpy as np

# create random data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.5

# Find values for W that compute y_data = W * x_data 
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
y = W * x_data

# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# Before starting, initialize the variables
init = tf.initialize_all_variables()

# Launch the graph.
sess = tf.Session()
sess.run(init)

# Fit the line.
for step in xrange(2001):
    sess.run(train)
    if step % 200 == 0:
        print(step, sess.run(W))
like image 270
NEW USER Avatar asked Mar 16 '16 09:03

NEW USER


People also ask

How does the stochastic gradient descent algorithm work?

“Gradient descent is an iterative algorithm, that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.” This algorithm is useful in cases where the optimal points cannot be found by equating the slope of the function to 0.


1 Answers

You have two part in your question:

  • How to change this problem to a higher dimension space.
  • How to change from the batch gradient descent to a stochastic gradient descent.

To get a higher dimensional setting, you can define your linear problem y = <x, w>. Then, you just need to change the dimension of your Variable W to match the one of w and replace the multiplication W*x_data by a scalar product tf.matmul(x_data, W) and your code should run just fine.

To change the learning method to a stochastic gradient descent, you need to abstract the input of your cost function by using tf.placeholder.
Once you have defined X and y_ to hold your input at each step, you can construct the same cost function. Then, you need to call your step by feeding the proper mini-batch of your data.

Here is an example of how you could implement such behavior and it should show that W quickly converges to w.

import tensorflow as tf
import numpy as np

# Define dimensions
d = 10     # Size of the parameter space
N = 1000   # Number of data sample

# create random data
w = .5*np.ones(d)
x_data = np.random.random((N, d)).astype(np.float32)
y_data = x_data.dot(w).reshape((-1, 1))

# Define placeholders to feed mini_batches
X = tf.placeholder(tf.float32, shape=[None, d], name='X')
y_ = tf.placeholder(tf.float32, shape=[None, 1], name='y')

# Find values for W that compute y_data = <x, W>
W = tf.Variable(tf.random_uniform([d, 1], -1.0, 1.0))
y = tf.matmul(X, W, name='y_pred')

# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y_ - y))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# Before starting, initialize the variables
init = tf.initialize_all_variables()

# Launch the graph.
sess = tf.Session()
sess.run(init)

# Fit the line.
mini_batch_size = 100
n_batch = N // mini_batch_size + (N % mini_batch_size != 0)
for step in range(2001):
    i_batch = (step % n_batch)*mini_batch_size
    batch = x_data[i_batch:i_batch+mini_batch_size], y_data[i_batch:i_batch+mini_batch_size]
    sess.run(train, feed_dict={X: batch[0], y_: batch[1]})
    if step % 200 == 0:
        print(step, sess.run(W))

Two side notes:

  • The implementation below is called a mini-batch gradient descent as at each step, the gradient is computed using a subset of our data of size mini_batch_size. This is a variant from the stochastic gradient descent that is usually used to stabilize the estimation of the gradient at each step. The stochastic gradient descent can be obtained by setting mini_batch_size = 1.

  • The dataset can be shuffle at every epoch to get an implementation closer to the theoretical consideration. Some recent work also consider only using one pass through your dataset as it prevent over-fitting. For a more mathematical and detailed explanation, you can see Bottou12. This can be easily change according to your problem setup and the statistic property your are looking for.

like image 183
Thomas Moreau Avatar answered Nov 03 '22 01:11

Thomas Moreau