I've taught myself machine learning with some online resources but I have a question about gradient descent that I couldn't figure out.
The formula for gradient descent is given by the following logistics regression:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj
is the coefficient on variable j; α
is the learning rate; hθ(x)
is the hypothesis; y
is real value and xj
is the value of variable j. m
is the number of training sets. hθ(x)
, y
are for each training set (i.e. that's what the summation sign is for).
This is where I get confused.
It's not clear to me if summation is representing my entire training set or how many iterations I have done up to that point.
For example, imagine I have 10 training examples. If I perform gradient descent after each training example, my coefficients will be very different then if I performed gradient descent after all 10 training examples.
See below how the First Way is different then the Second Way:
First way
Second way
I hope that explains my confusion. Does anyone know which way is correct?
Edit: Adding cost function and hypothesis function
cost function = −1/m∑[ylog(hθ(x))+(1−y)log(1−hθ(x))]
hθ(x) = 1/(1+ e^-z)
and z= θo + θ1X1+θ2X2 +θ3X3...θnXn
The second way you are describing it is the correct way to perform Gradient Descent. The true gradient is dependent on the whole data set, so one iteration of gradient descent requires using all of the data set. (This is true for any learning algorithm where you can take the gradient)
The "first way" is close to something that is called Stochastic Gradient Descent. The idea here is that using the whole data set for one update might be overkill, especially if some of the data points are redundant. In this case, we pick a random point from the data set - essentially setting m=1. We then update based on successive selections of single points in the data set. This way we can do m updates at about the same cost as one update of Gradient Descent. But each update is a bit noisy, which can make convergence to the final solution difficult.
The compromise between these approaches is called "MiniBatch". Taking the gradient of the whole data set is one full round of "batch" processing, as we need the whole data set on hand. Instead we will do a mini batch, selecting only a small subset of the whole data set. In this case we set k, 1 < k < m, where k is the number of points in the mini batch. We select k random data points to create the gradient from at every iteration, and then perform the update. Repeat until convergence. Obviously, increasing / decreasing k is a tradeoff between speed and accuracy.
Note: For both stochastic & mini batch gradient descent, it is important to shuffle / select randomly the next data point. If you use the same iteration order for each data point, you can get really weird / bad results - often diverging away from the solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With