Vectorized SVM gradient

Tags:

I was going through the code for SVM loss and derivative, I did understand the loss but I cannot understand how the gradient is being computed in a vectorized manner

def svm_loss_vectorized(W, X, y, reg):

loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
num_train = X.shape[0]

scores = X.dot(W)
yi_scores = scores[np.arange(scores.shape[0]),y] 
margins = np.maximum(0, scores - np.matrix(yi_scores).T + 1)
margins[np.arange(num_train),y] = 0
loss = np.mean(np.sum(margins, axis=1))
loss += 0.5 * reg * np.sum(W * W)

Understood up to here, After here I cannot understand why we are summing up row-wise in binary matrix and subtracting by its sum

binary = margins
binary[margins > 0] = 1
row_sum = np.sum(binary, axis=1)
binary[np.arange(num_train), y] = -row_sum.T
dW = np.dot(X.T, binary)

# Average
dW /= num_train

# Regularize
dW += reg*W

return loss, dW

416

asked Dec 02 '17 12:12

Nikhil Verma

1 Answers

Let us recap the scenario and the loss function first, so we are on the same page:

Given are P sample points in N-dimensional space in the form of a PxN matrix X, so the points are the rows of this matrix. Each point in X is assigned to one out of M categories. These are given as a vector Y of length P that has integer values between 0 and M-1.

The goal is to predict the classes of all points by M linear classifiers (one for each category) given in the form of a weight matrix W of shape NxM, so the classifiers are the columns of W. To predict the categories of all samples X the scalar products between all points and all weight vectors are formed. This is the same as matrix multiplying X and W yielding a score matrix Y0 that is arranged such that its rows are ordered like theh elements of Y, each row corresponds to one sample. The predicted category for each sample is simply that with the largest score.

There are no bias terms so I presume there is some kind of symmetry or zero mean assumption.

Now, to find a good set of weights we want a loss function that is small for good predictions and large for bad predictions and that lets us do gradient descent. One of the most straight-forward ways is to just punish for each sample i each score that is larger than the score of the correct category for that sample and let the penalty grow linearly with the difference. So if we write A[i] for the set of categories j that score more than the correct category Y0[i, j] > Y0[i, Y[i]] the loss for sample i could be written as

sum_{j in A[i]} (Y0[i, j] - Y0[i, Y[i]])

or equivalently if we write #A[i] for the number of elements in A[i]

(sum_{j in A[i]} Y0[i, j]) - #A[i] Y0[i, Y[i]]

The partial derivatives with respect to the score are thus simply

                    | -#A[i]      if j == Y[i]
dloss / dY0[i, j] = {      1      if j in A[i]
                    |      0      else

which is precisely what the first four lines you say you don't understand compute.

The next line applies the chain rule dloss/dW = dloss/dY0 dY0/dW.

It remains to divide by the number of samples to get a per sample loss and to add the derivative of the regulatization term which the regularization being just a componentwise quadratic function is easy.

answered Oct 11 '22 13:10

Paul Panzer

Related questions
                            
                                Easily producing Python AST for multiple expressions in one line
                            
                                Django model fields from constants?
                            
                                Python 3 reading CSV file with line breaks in rows
                            
                                Indexing with lists and arrays in numpy appears inconsistent
                            
                                Slicing a DataGrameGroupBy object
                            
                                Generate S3 pre-signed URL with v4 signature using python boto3
                            
                                How to verify certificate signature in pyOpenSSL?
                            
                                why will my buffer overflow exploit open a user shell only instead of a root shell?
                            
                                whl is not a supported wheel on this platform
                            
                                tensorflow neural network multi layer perceptron for regression example
                            
                                Login to instagram using python
                            
                                Debugging parallel Python programs (mpi4py)
                            
                                Slicing-based assignment in Keras / Tensorflow?
                            
                                Django admin display filter depending on other filters
                            
                                Using a dataframe's name as string
                            
                                Tensorflow- ImportError: libcublas.so.8.0: cannot open shared object file: No such file or directory
                            
                                ValueError at /image/ Tensor Tensor("activation_5/Softmax:0", shape=(?, 4), dtype=float32) is not an element of this graph
                            
                                Loading/Parsing Mathematical Programming System files
                            
                                Print more than 20 posts from Tumblr API
                            
                                botocore s3 put has issue hashing file due to encoding?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Vectorized SVM gradient

Tags:

python

numpy

svm

Nikhil Verma

People also ask

1 Answers

Paul Panzer

Recent Activity

Donate For Us