Gradient Descent vs Stochastic Gradient Descent algorithms

2 Answers

I'll try to give you some intuition over the problem...

Initially, updates were made in what you (correctly) call (Batch) Gradient Descent. This assures that each update in the weights is done in the "right" direction (Fig. 1): the one that minimizes the cost function.

Gradient Descent

With the growth of datasets size, and complexier computations in each step, Stochastic Gradient Descent came to be preferred in these cases. Here, updates to the weights are done as each sample is processed and, as such, subsequent calculations already use "improved" weights. Nonetheless, this very reason leads to it incurring in some misdirection in minimizing the error function (Fig. 2).

enter image description here

As such, in many situations it is preferred to use Mini-batch Gradient Descent, combining the best of both worlds: each update to the weights is done using a small batch of the data. This way, the direction of the updates is somewhat rectified in comparison with the stochastic updates, but is updated much more regularly than in the case of the (original) Gradient Descent.

[UPDATE] As requested, I present below the pseudocode for batch gradient descent in binary classification:

error = 0

for sample in data:
    prediction = neural_network.predict(sample)
    sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as 
                                                # module(prediction - sample["label"])
    error += sample_error

neural_network.backpropagate_and_update(error)

(In the case of multi-class labeling, error represents an array of the error for each label.)

This code is run for a given number of iterations, or while the error is above a threshold. For stochastic gradient descent, the call to neural_network.backpropagate_and_update() is called inside the for cycle, with the sample error as argument.

165

answered Oct 11 '22 16:10

Diogo Pinto

The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

The 3 most common flavors according to this document are (Your flavor is C):

randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
    for training sample i:
        compute gradients and perform weight updates

for one or more epochs, or until approx. cost minimum is reached:
    randomly shuffle samples in the training set
    for training sample i:
        compute gradients and perform weight updates

for iterations t, or until approx. cost minimum is reached:
    draw random sample from the training set
    compute gradients and perform weight updates

answered Oct 11 '22 16:10

SomethingSomething

Related questions
                            
                                Information gain on non discrete dataset
                            
                                Filling NAN data with mode() doesn't work -Pandas
                            
                                Why do we use fully-connected layer at the end of CNN?
                            
                                Tensorflow model import to Java
                            
                                Python Machine Learning Functions [closed]
                            
                                How to use very large dataset in RNN TensorFlow?
                            
                                RNN: What is the use of return_sequences in LSTM layer in Keras Framework
                            
                                Concat tensors in PyTorch
                            
                                How to include words as numerical feature in classification
                            
                                bad result when using precomputed chi2 kernel with libsvm (matlab)
                            
                                Train scikit svm one by one (online or stochastic training)
                            
                                What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?
                            
                                How to cache a large machine learning model in Flask?
                            
                                Spark ALS predictAll returns empty
                            
                                Compare similarity between names
                            
                                How can I invoke AWS SageMaker endpoint to get inferences?
                            
                                UnboundLocalError: local variable 'batch_index' referenced before assignment
                            
                                Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)
                            
                                Why I'm getting bad result with Keras vs random forest or knn?
                            
                                How many features can scikit-learn handle?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Gradient Descent vs Stochastic Gradient Descent algorithms

Tags:

machine-learning

neural-network

computer-vision

gradient-descent

kuch11

People also ask

2 Answers

Diogo Pinto

SomethingSomething

Recent Activity

Donate For Us