Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gradient Descent vs Stochastic Gradient Descent algorithms

I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples).

I each time iterated over all the training samples, performing Backpropagation for each such sample on every epoch. The runtime is of course too long.

  • Is the algorithm I ran named Gradient Descent?

I read that for large datasets, using Stochastic Gradient Descent can improve the runtime dramatically.

  • What should I do in order to use Stochastic Gradient Descent? Should I just pick the training samples randomly, performing Backpropagation on each randomly picked sample, instead of the epochs I currently use?
like image 953
kuch11 Avatar asked Feb 29 '16 22:02

kuch11


People also ask

Why is SGD used instead of batch gradient descent?

SGD can be used when the dataset is large. Batch Gradient Descent converges directly to minima. SGD converges faster for larger datasets.

Is SGD faster than gradient descent?

Stochastic gradient descent (SGD or "on-line") typically reaches convergence much faster than batch (or "standard") gradient descent since it updates weight more frequently.

What is the advantage of SGD over regular gradient descent?

SGD is stochastic in nature i.e. it picks up a “random” instance of training data at each step and then computes the gradient, making it much faster as there is much fewer data to manipulate at a single time, unlike Batch GD.


2 Answers

I'll try to give you some intuition over the problem...

Initially, updates were made in what you (correctly) call (Batch) Gradient Descent. This assures that each update in the weights is done in the "right" direction (Fig. 1): the one that minimizes the cost function.

Gradient Descent

With the growth of datasets size, and complexier computations in each step, Stochastic Gradient Descent came to be preferred in these cases. Here, updates to the weights are done as each sample is processed and, as such, subsequent calculations already use "improved" weights. Nonetheless, this very reason leads to it incurring in some misdirection in minimizing the error function (Fig. 2).

enter image description here

As such, in many situations it is preferred to use Mini-batch Gradient Descent, combining the best of both worlds: each update to the weights is done using a small batch of the data. This way, the direction of the updates is somewhat rectified in comparison with the stochastic updates, but is updated much more regularly than in the case of the (original) Gradient Descent.

[UPDATE] As requested, I present below the pseudocode for batch gradient descent in binary classification:

error = 0

for sample in data:
    prediction = neural_network.predict(sample)
    sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as 
                                                # module(prediction - sample["label"])
    error += sample_error

neural_network.backpropagate_and_update(error)

(In the case of multi-class labeling, error represents an array of the error for each label.)

This code is run for a given number of iterations, or while the error is above a threshold. For stochastic gradient descent, the call to neural_network.backpropagate_and_update() is called inside the for cycle, with the sample error as argument.

like image 165
Diogo Pinto Avatar answered Oct 11 '22 16:10

Diogo Pinto


The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

The 3 most common flavors according to this document are (Your flavor is C):

A)

randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
    for training sample i:
        compute gradients and perform weight updates

B)

for one or more epochs, or until approx. cost minimum is reached:
    randomly shuffle samples in the training set
    for training sample i:
        compute gradients and perform weight updates

C)

for iterations t, or until approx. cost minimum is reached:
    draw random sample from the training set
    compute gradients and perform weight updates
like image 4
SomethingSomething Avatar answered Oct 11 '22 16:10

SomethingSomething