I am trying to understand a simple implementation of Softmax classifier from this link - CS231n - Convolutional Neural Networks for Visual Recognition. Here they implemented a simple softmax classifier. In the example of Softmax Classifier on the link, there are random 300 points on a 2D space and a label associated with them. The softmax classifier will learn which point belong to which class.
Here is the full code of the softmax classifier. Or you can see the link I have provided.
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in xrange(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
corect_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(corect_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
I cant understand how they computed the gradient here. I assume that they computed the gradient here -
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
But How? I mean Why gradient of dW
is np.dot(X.T, dscores)
? And Why the gradient of db
is np.sum(dscores, axis=0, keepdims=True)
?? So how they computed the gradient on weight and bias? Also why they computed the regularization gradient
?
I am just starting to learn about convolutional neural networks and deep learning. And I heard that CS231n - Convolutional Neural Networks for Visual Recognition
is a good starting place for that. I did not know where to place deep learning related post. So, i placed them on stackoverflow. If there is any place to post questions related to deep learning please let me know.
The Softmax classifier uses the cross-entropy loss. The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied.
The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities.
By Jason Brownlee on October 19, 2020 in Deep Learning. Softmax is a mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.
The gradients start being computed here:
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
First, this sets dscores
equal to the probabilities computed by the softmax function. Then, it subtracts 1
from the probabilities computed for the correct classes in the second line, and then it divides by the number of training samples in the third line.
Why does it subtract 1
? Because you want the probabilities of the correct labels to be 1
, ideally. So it subtracts what it should predict from what it actually predicts: if it predicts something close to 1
, the subtraction will be a large negative number (close to zero), so the gradient will be small, because you're close to a solution. Otherwise, it will be a small negative number (far from zero), so the gradient will be bigger, and you'll take larger steps towards the solution.
Your activation function is simply w*x + b
. Its derivative with respect to w
is x
, which is why dW
is the dot product between x
and the gradient of the scores / output layer.
The derivative of w*x + b
with respect to b
is 1
, which is why you simply sum dscores
when backpropagating.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With