Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized SVM gradient

Tags:

python

numpy

svm

I was going through the code for SVM loss and derivative, I did understand the loss but I cannot understand how the gradient is being computed in a vectorized manner

def svm_loss_vectorized(W, X, y, reg):

loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
num_train = X.shape[0]

scores = X.dot(W)
yi_scores = scores[np.arange(scores.shape[0]),y] 
margins = np.maximum(0, scores - np.matrix(yi_scores).T + 1)
margins[np.arange(num_train),y] = 0
loss = np.mean(np.sum(margins, axis=1))
loss += 0.5 * reg * np.sum(W * W)

Understood up to here, After here I cannot understand why we are summing up row-wise in binary matrix and subtracting by its sum

binary = margins
binary[margins > 0] = 1
row_sum = np.sum(binary, axis=1)
binary[np.arange(num_train), y] = -row_sum.T
dW = np.dot(X.T, binary)

# Average
dW /= num_train

# Regularize
dW += reg*W

return loss, dW
like image 416
Nikhil Verma Avatar asked Dec 02 '17 12:12

Nikhil Verma


People also ask

What is gradient in SVM?

A gradient is a vector that contains every partial derivative of a function. Partial derivatives define how the slope of a function changes with respect to an infinitesimal change in one variable.

Is SVM based on gradient descent?

Support vector machine is a supervised machine learning algorithm, which is usually solved by iterative method. Quantum algorithms have significant advantages over classical algorithms in terms of speed and capacity. In this work, a quantum support vector machine algorithm based on gradient descent is proposed.

What is the hinge loss in SVM?

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

What is the loss function in SVM?

In SVM models, we use the loss function to measure the empirical risk of given training set. The characteristics and performance of a SVM model depends upon the way it measures the empirical error of the given training set. Therefore, the choice of loss function is very crucial in SVM models.


1 Answers

Let us recap the scenario and the loss function first, so we are on the same page:

Given are P sample points in N-dimensional space in the form of a PxN matrix X, so the points are the rows of this matrix. Each point in X is assigned to one out of M categories. These are given as a vector Y of length P that has integer values between 0 and M-1.

The goal is to predict the classes of all points by M linear classifiers (one for each category) given in the form of a weight matrix W of shape NxM, so the classifiers are the columns of W. To predict the categories of all samples X the scalar products between all points and all weight vectors are formed. This is the same as matrix multiplying X and W yielding a score matrix Y0 that is arranged such that its rows are ordered like theh elements of Y, each row corresponds to one sample. The predicted category for each sample is simply that with the largest score.

There are no bias terms so I presume there is some kind of symmetry or zero mean assumption.

Now, to find a good set of weights we want a loss function that is small for good predictions and large for bad predictions and that lets us do gradient descent. One of the most straight-forward ways is to just punish for each sample i each score that is larger than the score of the correct category for that sample and let the penalty grow linearly with the difference. So if we write A[i] for the set of categories j that score more than the correct category Y0[i, j] > Y0[i, Y[i]] the loss for sample i could be written as

sum_{j in A[i]} (Y0[i, j] - Y0[i, Y[i]])

or equivalently if we write #A[i] for the number of elements in A[i]

(sum_{j in A[i]} Y0[i, j]) - #A[i] Y0[i, Y[i]]

The partial derivatives with respect to the score are thus simply

                    | -#A[i]      if j == Y[i]
dloss / dY0[i, j] = {      1      if j in A[i]
                    |      0      else

which is precisely what the first four lines you say you don't understand compute.

The next line applies the chain rule dloss/dW = dloss/dY0 dY0/dW.

It remains to divide by the number of samples to get a per sample loss and to add the derivative of the regulatization term which the regularization being just a componentwise quadratic function is easy.

like image 73
Paul Panzer Avatar answered Oct 11 '22 13:10

Paul Panzer