Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gradient Descent: Do we iterate on ALL of the training set with each step in GD? or Do we change GD for each training set?

I've taught myself machine learning with some online resources but I have a question about gradient descent that I couldn't figure out.

The formula for gradient descent is given by the following logistics regression:

Repeat {
    θj = θj−α/m∑(hθ(x)−y)xj
}

Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for).

This is where I get confused.

It's not clear to me if summation is representing my entire training set or how many iterations I have done up to that point.

For example, imagine I have 10 training examples. If I perform gradient descent after each training example, my coefficients will be very different then if I performed gradient descent after all 10 training examples.

See below how the First Way is different then the Second Way:

First way

  • Step 1: Since coefficients initialized to 0, hθ(x)=0
  • Step 2: Perform gradient descent on the first training example. Summation term only includes 1 training example
  • Step 3: Now use new coefficients for training examples 1 & 2... summation term includes first 2 training examples
  • Step 4: Perform gradient descent again.
  • Step 5: Now use new coefficients for training examples 1,2 &3... summation term includes first 3 training examples
  • Continue until convergence or all training examples used.

Second way

  • Step 1: Since coefficients initialized to 0, hθ(x)=0 for all 10 training examples
  • Step 2: Perform 1 step of gradient descent using all 10 training examples. Coefficients will be different from the First Way because the summation term includes all 10 training examples
  • Step 3: Use new coefficients on all 10 training examples again. summation term includes all 10 training examples
  • Step 4: Perform gradient descent and continue using coefficients on all examples until convergence

I hope that explains my confusion. Does anyone know which way is correct?

Edit: Adding cost function and hypothesis function

cost function = −1/m∑[ylog(hθ(x))+(1−y)log(1−hθ(x))]
hθ(x) = 1/(1+ e^-z) 
and z= θo + θ1X1+θ2X2 +θ3X3...θnXn
like image 648
Terence Chow Avatar asked Jun 24 '13 19:06

Terence Chow


1 Answers

The second way you are describing it is the correct way to perform Gradient Descent. The true gradient is dependent on the whole data set, so one iteration of gradient descent requires using all of the data set. (This is true for any learning algorithm where you can take the gradient)

The "first way" is close to something that is called Stochastic Gradient Descent. The idea here is that using the whole data set for one update might be overkill, especially if some of the data points are redundant. In this case, we pick a random point from the data set - essentially setting m=1. We then update based on successive selections of single points in the data set. This way we can do m updates at about the same cost as one update of Gradient Descent. But each update is a bit noisy, which can make convergence to the final solution difficult.

The compromise between these approaches is called "MiniBatch". Taking the gradient of the whole data set is one full round of "batch" processing, as we need the whole data set on hand. Instead we will do a mini batch, selecting only a small subset of the whole data set. In this case we set k, 1 < k < m, where k is the number of points in the mini batch. We select k random data points to create the gradient from at every iteration, and then perform the update. Repeat until convergence. Obviously, increasing / decreasing k is a tradeoff between speed and accuracy.

Note: For both stochastic & mini batch gradient descent, it is important to shuffle / select randomly the next data point. If you use the same iteration order for each data point, you can get really weird / bad results - often diverging away from the solution.

like image 87
Raff.Edward Avatar answered Oct 20 '22 03:10

Raff.Edward