I've implemented both the batch and stochastic gradient descent. I'm experiencing some issues though. This is the stochastic rule:
1 to m {
theta(j):=theta(j)-step*derivative (for all j)
}
The issue I have is that, even though the cost function is becoming smaller and smaller the testing says it's not good. If I change the step a bit and change the number of iterations, the cost function is a bit bigger in value but the results are ok. Is this an overfitting "symptom"? How do I know which one is the right one? :)
As I said, even though the cost function is more minimized the testing says it's not good.
Gradient descent is a local search method for minimizing a function. When it reaches a local minimum in the parameter space, it won't be able to go any further. This makes gradient descent (and other local methods) prone to getting stuck in local minima, rather than reaching the global minimum. The local minima may or may not be good solutions for what you're trying to achieve. What to expect will depend on the function that you're trying to minimize.
In particular, high-dimensional NP-complete problems can be tricky. They often have exponentially many local optima, with many of them nearly as good as the global optimum in terms of cost, but with parameter values orthogonal to those for the global optimum. These are hard problems: you don't generally expect to be able to find the global optimum, instead just looking for a local minimum that is good enough. These are also relevant problems: many interesting problems have just these properties.
I'd suggest first testing your gradient descent implementation with an easy problem. You might try finding the minimum in a polynomial. Since it's a one-parameter problem, you can plot the progress of the parameter values along the curve of the polynomial. You should be able to see if something is drastically wrong, and can also observe how the search gets stuck in local minima. You should also be able to see that the initial parameter choice can matter quite a lot.
For dealing with harder problems, you might modify your algorithm to help it escape the local minima. A few common approaches:
Add noise. This reduces the precision of the parameters you've found, which can "blur" out local minima. The search can then jump out of local minima that are small compared to the noise, while still being trapped in deeper minima. A well-known approach for adding noise is simulated annealing.
Add momentum. Along with using the current gradient to define the step, also continue in the same direction as the previous step. If you take a fraction of the previous step as the momentum term, there is a tendency to keep going, which can take the search past the local minimum. By using a fraction, the steps decay exponentially, so poor steps aren't a big problem. This was always a popular modification to gradient descent when used to train neural networks, where gradient descent is known as backpropagation.
Use a hybrid search. First use a global search (e.g., genetic algorithms, various Monte Carlo methods) to find some good starting points, then apply gradient descent to take advantage of the gradient information in the function.
I won't make a recommendation on which to use. Instead, I'll suggest doing a little research to see what others have done with problems related to what you're working on. If it's purely a learning experience, momentum is probably the easiest to get working.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With