I'm trying to write out a bit of code for the gradient descent algorithm explained in the Stanford Machine Learning lecture (lecture 2 at around 25:00). Below is the implementation I used at first, and I think it's properly copied over from the lecture, but it doesn't converge when I add large numbers (>8
) to the training set.
I'm inputting a number X
, and the point (X,X)
is added to the training set, so at the moment, I'm only trying to get it to converge to y=ax+b
where a=1=theta\[1\]
and b=0=theta\[0\]
.
The training set is the array x
and y
, where (x[i],y[i])
is a point.
void train()
{
double delta;
for (int i = 0; i < x.size(); i++)
{
delta = y[i]-hypothesis(x[i]);
theta[1] += alpha*delta*x[i];
theta[0] += alpha*delta*1;
}
}
void C_Approx::display()
{
std::cout<<theta[1]<<"x + "<<theta[0]<<" \t "<<"f(x)="<<hypothesis(1)<<std::endl;
}
some of the results I'm getting:
I input a number, it runs train()
a few times, then display()
1
0.33616x + 0.33616 f(x)=0.67232
1
0.482408x + 0.482408 f(x)=0.964816
1
0.499381x + 0.499381 f(x)=0.998762
1
0.499993x + 0.499993 f(x)=0.999986
1
0.5x + 0.5 f(x)=1
An example of it diverging after it passed 8
:
1
0.33616x + 0.33616 f(x)=0.67232
2
0.705508x + 0.509914 f(x)=1.21542
3
0.850024x + 0.449928 f(x)=1.29995
4
0.936062x + 0.330346 f(x)=1.26641
5
0.951346x + 0.231295 f(x)=1.18264
6
0.992876x + 0.137739 f(x)=1.13062
7
0.932206x + 0.127372 f(x)=1.05958
8
1.00077x + 0.000493063 f(x)=1.00126
9
-0.689325x + -0.0714712 f(x)=-0.760797
10
4.10321e+08x + 4.365e+07 f(x)=4.53971e+08
11
1.79968e+22x + 1.61125e+21 f(x)=1.9608e+22
12
-3.9452e+41x + -3.26957e+40 f(x)=-4.27216e+41
I tried the solution proposed here of scaling the step and ended up with similar results. What am I doing wrong?
If the learning rate is too small, the descent will be small and hence there will be a delayed or no convergence on the other hand if the learning rate is too large, then gradient descent will overshoot the minimum point and will ultimately fail to converge.
Intuitively, this means that gradient descent is guaranteed to converge and that it converges with rate O(1/k). value strictly decreases with each iteration of gradient descent until it reaches the optimal value f(x) = f(x∗).
We see above that gradient descent can reduce the cost function, and can converge when it reaches a point where the gradient of the cost function is zero.
No, they always don't. That's because in some cases it reaches a local minima or a local optima point.
Your implementation is good. Generally, stochastic gradient descent might diverge when α is too large. What you would do with a large dataset is take a reasonably sized random sample, find α that gives you the best results, and then use it for the rest.
I have experienced the same problem (albeit in Java) because my learning rate was too big.
For short, I was using α = 0.001
and I had to push it to 0.000001
to see actual convergence.
Of course these values are linked to your dataset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With