I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples).
I each time iterated over all the training samples, performing Backpropagation for each such sample on every epoch. The runtime is of course too long.
I read that for large datasets, using Stochastic Gradient Descent can improve the runtime dramatically.
SGD can be used when the dataset is large. Batch Gradient Descent converges directly to minima. SGD converges faster for larger datasets.
Stochastic gradient descent (SGD or "on-line") typically reaches convergence much faster than batch (or "standard") gradient descent since it updates weight more frequently.
SGD is stochastic in nature i.e. it picks up a “random” instance of training data at each step and then computes the gradient, making it much faster as there is much fewer data to manipulate at a single time, unlike Batch GD.
I'll try to give you some intuition over the problem...
Initially, updates were made in what you (correctly) call (Batch) Gradient Descent. This assures that each update in the weights is done in the "right" direction (Fig. 1): the one that minimizes the cost function.
With the growth of datasets size, and complexier computations in each step, Stochastic Gradient Descent came to be preferred in these cases. Here, updates to the weights are done as each sample is processed and, as such, subsequent calculations already use "improved" weights. Nonetheless, this very reason leads to it incurring in some misdirection in minimizing the error function (Fig. 2).
As such, in many situations it is preferred to use Mini-batch Gradient Descent, combining the best of both worlds: each update to the weights is done using a small batch of the data. This way, the direction of the updates is somewhat rectified in comparison with the stochastic updates, but is updated much more regularly than in the case of the (original) Gradient Descent.
[UPDATE] As requested, I present below the pseudocode for batch gradient descent in binary classification:
error = 0
for sample in data:
prediction = neural_network.predict(sample)
sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as
# module(prediction - sample["label"])
error += sample_error
neural_network.backpropagate_and_update(error)
(In the case of multi-class labeling, error represents an array of the error for each label.)
This code is run for a given number of iterations, or while the error is above a threshold. For stochastic gradient descent, the call to neural_network.backpropagate_and_update() is called inside the for cycle, with the sample error as argument.
The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent
The 3 most common flavors according to this document are (Your flavor is C):
A)
randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
for training sample i:
compute gradients and perform weight updates
B)
for one or more epochs, or until approx. cost minimum is reached:
randomly shuffle samples in the training set
for training sample i:
compute gradients and perform weight updates
C)
for iterations t, or until approx. cost minimum is reached:
draw random sample from the training set
compute gradients and perform weight updates
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With