Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow: Linear regression with non-negative constraints

I am trying to implement a linear regression model in Tensorflow, with additional constraints (coming from the domain) that the W and b terms must be non-negative.

I believe there are a couple of ways to do this.

  1. We can modify the cost function to penalize negative weights [Lagrangian approach] [See:TensorFlow - best way to implement weight constraints
  2. We can compute the gradients ourselves and project them on [0, infinity] [Projected gradient approach]

Approach 1: Lagrangian

When I tried the first approach, I would often end up with negative b.

I had modified the cost function from:

cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)

to:

cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
nn_w = tf.reduce_sum(tf.abs(W) - W)
nn_b = tf.reduce_sum(tf.abs(b) - b)
constraint = 100.0*nn_w + 100*nn_b
cost_with_constraint = cost + constraint
Keeping the coefficient of nn_b and nn_w to be very high leads to instability and very high cost.

Here is the complete code.

import numpy as np
import tensorflow as tf

n_samples = 50
train_X = np.linspace(1, 50, n_samples)
train_Y = 10*train_X + 6 +40*np.random.randn(50)

X = tf.placeholder("float")
Y = tf.placeholder("float")

# Set model weights
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")

# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)

# Gradient descent
learning_rate=0.0001
# Initializing the variables
init = tf.global_variables_initializer()

# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
nn_w = tf.reduce_sum(tf.abs(W) - W)
nn_b = tf.reduce_sum(tf.abs(b) - b)
constraint = 1.0*nn_w + 100*nn_b
cost_with_constraint = cost + constraint
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_with_constraint)

training_epochs=200
with tf.Session() as sess:
    sess.run(init)

    # Fit all training data
    cost_array = np.zeros(training_epochs)
    W_array = np.zeros(training_epochs)
    b_array = np.zeros(training_epochs)

    for epoch in range(training_epochs):
        for (x, y) in zip(train_X, train_Y):
            sess.run(optimizer, feed_dict={X: x, Y: y})
            W_array[epoch] = sess.run(W)
            b_array[epoch] = sess.run(b)
            cost_array[epoch] = sess.run(cost, feed_dict={X: train_X, Y: train_Y})

The following is the mean of b across 10 different runs.

0   -1.101268
1    0.169225
2    0.158363
3    0.706270
4   -0.371205
5    0.244424
6    1.312516
7   -0.069609
8   -1.032187
9   -1.711668

Clearly, the first approach is not optimal. Further, there is a lot of art involved in choosing the coefficient of penalty terms.

Approach 2: Projected gradient

I then thought to use the second approach, which is more guaranteed to work.

gr = tf.gradients(cost, [W, b])

We manually compute the gradients and update the W and b.

 with tf.Session() as sess:
    sess.run(init)


    for epoch in range(training_epochs):
        for (x, y) in zip(train_X, train_Y):
            W_del, b_del = sess.run(gr, feed_dict={X: x, Y: y})
            W = max(0, (W - W_del)*learning_rate) #Project the gradient on [0, infinity]
            b = max(0, (b - b_del)*learning_rate) # Project the gradient on [0, infinity]

This approach seems to be very slow.

I am wondering if there is a better way to run the second approach, or guarantee the results with the first approach. Can we somehow allow the optimizer to ensure that the learnt weights are non-negative?

Edit: How to do this in Autograd

https://github.com/HIPS/autograd/issues/207

like image 333
Nipun Batra Avatar asked Mar 28 '17 09:03

Nipun Batra


2 Answers

If you modify your linear model to:

pred = tf.add(tf.multiply(X, tf.abs(W)), tf.abs(b))

it will have the same effect as using only positive W and b values.

The reason your second approach is slow is that you clip the W and b values outside of the tensorflow graph. (Also it will not converge because (W - W_del)*learning_rate must instead be W - W_del*learning_rate)

edit:

You can implement the clipping using tensorflow graph like this:

train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

with tf.control_dependencies([train_step]):
    clip_W = W.assign(tf.maximum(0., W))
    clip_b = b.assign(tf.maximum(0., b))
    train_step_with_clip = tf.group(clip_W, clip_b)

In this case W and b values will be clipped to 0 and not to small positive numbers.

Here is a small mnist example with clipping:

import tensorflow as tf

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x = tf.placeholder(tf.uint8, [None, 28, 28])
x_vec = tf.cast(tf.reshape(x, [-1, 784]), tf.float32) / 255.

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.matmul(x_vec, W) + b

y_target = tf.placeholder(tf.uint8, [None])
y_target_one_hot = tf.one_hot(y_target, 10)

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_target_one_hot, logits=y))

train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

with tf.control_dependencies([train_step]):
    clip_W = W.assign(tf.maximum(0., W))
    clip_b = b.assign(tf.maximum(0., b))
    train_step_with_clip = tf.group(clip_W, clip_b)

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_target_one_hot, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
  tf.global_variables_initializer().run()

  for i in range(1000):
    sess.run(train_step_with_clip, feed_dict={
        x: x_train[(i*100)%len(x_train):((i+1)*100)%len(x_train)],
        y_target: y_train[(i*100)%len(x_train):((i+1)*100)%len(x_train)]})

    if not i%100:
      print("Min_W:", sess.run(tf.reduce_min(W)))
      print("Min_b:", sess.run(tf.reduce_min(b)))

  print("Accuracy:", sess.run(accuracy, feed_dict={
      x: x_test,
      y_target: y_test}))
like image 164
BlueSun Avatar answered Nov 02 '22 19:11

BlueSun


I actually was not able to reproduce your problem of getting negative bs with your first approach. But I do agree that this is not optimal for your use case and can result in negative values.

You should be able to constrain your parameters to non-negative values like so:

W *= tf.cast(W > 0., tf.float32)
b *= tf.cast(b > 0., tf.float32)

(exchange > with >= if necessary, the cast is necessary as the comparison operators will produce boolean values. You then would optimize for the "standard cost" without the additional constraints. However, this does not work in every case. For example, it should be avoided to initialize W or b with negative values in the beginning.

Your second (and probably better) approach can be accelerated by defining the update logic in the general computational graph, i.e. after the definition of cost

params = [W, b]
grads = tf.gradients(cost, params)
optimizer = [tf.assign(param, tf.maximum(0., param - grad*learning_rate))
             for param, grad in zip(params, grads)]

I think your solution is slow because it creates new computation nodes every time which is probably very costly and repeated a lot inside the loops.

update using tensorflow optimizer

In my solution above not the gradients are clipped but rather the resulting update values. Along the lines of this answer you could clip the gradients to be at most the value of the updated parameter like so:

params = [W, b]
opt = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = opt.compute_gradients(cost, params)
clipped_grads_vars = [(tf.clip_by_value(grad, -np.inf, var), var) for grad, var in grads_and_vars]
optimizer = opt.apply_gradients(clipped_grads_vars)

This way an update will never decrease a parameter to a value below 0. However, I think this will not work in the case the updated variable is already negative. Also, if the optimizing algorithm somehow multiplies the clipped gradient by a value greater than 1. The latter might actually never happen, but I'm not 100% sure.

like image 34
vdey Avatar answered Nov 02 '22 17:11

vdey