Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Full gradient descent in keras

I am trying to implement full gradient descent in keras. This means that for each epoch I am training on the entire dataset. This is why the batch size is defined to be the length size of the training set.

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD,Adam
from keras import regularizers
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  
import random
from numpy.random import seed
import random

def xrange(start_point,end_point,N,base):
    temp = np.logspace(0.1, 1, N,base=base,endpoint=False)
    temp=temp-temp.min()
    temp=(0.0+temp)/(0.0+temp.max()) #this is between 0 and 1
    return (end_point-start_point)*temp +start_point #this is the range

def train_model(x_train,y_train,x_test):
    #seed(1)
    model=Sequential()
    num_units=100
    act='relu'
    model.add(Dense(num_units,input_shape=(1,),activation=act)) 
    model.add(Dense(num_units,activation=act))
    model.add(Dense(num_units,activation=act))
    model.add(Dense(num_units,activation=act))
    model.add(Dense(1,activation='tanh')) #output layer 1 unit ; activation='tanh'
    model.compile(Adam(),'mean_squared_error',metrics=['mse'])
    history=model.fit(x_train,y_train,batch_size=len(x_train),epochs=500,verbose=0,validation_split = 0.2 ) #train on the noise (not moshe)
    fit=model.predict(x_test)
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    return fit

N = 1024
start_point=-5.25
end_point=5.25
base=500# the base of the log of the trainning
train_step=0.0007
x_test=np.arange(start_point,end_point,train_step+0.05)

x_train=xrange(start_point,end_point,N,base)
#random.shuffle(x_train)

function_y=np.sin(3*x_train)/2
noise=np.random.uniform(-0.2,0.2,len(function_y))
y_train=function_y+noise
fit=train_model(x_train,y_train,x_test)

plt.scatter(x_train,y_train, facecolors='none', edgecolors='g') #plt.plot(x_value,sample,'bo')
plt.scatter(x_test, fit, facecolors='none', edgecolors='b') #plt.plot(x_value,sample,'bo')

enter image description here

However when I uncomment the #random.shuffle(x_train) - in order to shuffle the trainning. plot:

I don't understand why I get different plots (the green circles are the trainning and the blue are the are what the modern learned). as in both cases the batch is of ALL the dataset. So the shuffle shouldn't change anything.
Thank you .

Ariel

like image 521
user552231 Avatar asked Dec 13 '18 20:12

user552231


People also ask

Does Keras use gradient descent?

Keras allows you to train your model using stochastic, batch, or minibatch gradient descent. This can be achieved by setting the batch_size argument on the call to the fit() function when training your model.

What is Optimizer =' Adam?

Adam optimizer involves a combination of two gradient descent methodologies: Momentum: This algorithm is used to accelerate the gradient descent algorithm by taking into consideration the 'exponentially weighted average' of the gradients. Using averages makes the algorithm converge towards the minima in a faster pace.

What is full batch gradient descent?

Batch Gradient Descent: Batch Gradient Descent involves calculations over the full training set at each step as a result of which it is very slow on very large training data. Thus, it becomes very computationally expensive to do Batch GD. However, this is great for convex or relatively smooth error manifolds.

Is RMSProp better than Adam?

RMSProp uses the second moment by with a decay rate to speed up from AdaGrad. Adam uses both first and second moments, and is generally the best choice. There are a few other variations of gradient descent algorithms, such as Nesterov accelerated gradient, AdaDelta, etc., that are not covered in this post.


1 Answers

This happens for two reasons:

  • First, when the data is not shuffled, the train/validation split is inappropriate.
  • Second, full gradient descent performs a single update per epoch, so more training epochs might be required to converge.

Why doesn't your model match the wave?

From model.fit:

  • validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.

Which means that your validation set consists of the last 20% training samples. Because you are using a log scale for your independent variable (x_train), it turns out that your train/validation split is:

split_point = int(0.2*N)
x_val = x_train[-split_point:]
y_val = y_train[-split_point:]
x_train_ = x_train[:-split_point]
y_train_ = y_train[:-split_point]
plt.scatter(x_train_, y_train_, c='g')
plt.scatter(x_val, y_val, c='r')
plt.show()

Train - validation split

In the previous plot, training and validation data are represented by green and red points, respectively. Note that your training dataset is not representative of the whole population.


Why does it still not match the training dataset?

In addition to an inappropriate train/test split, full gradient descent might require more training epochs to converge (the gradients are less noisy, but it only performs a single gradient update per epoch). If, instead, you train your model for ~1500 epochs (or use mini-batch gradient descent with a batch size of, say, 32), you end up getting:

Result

like image 187
rvinas Avatar answered Oct 18 '22 21:10

rvinas