I'm using log_loss with sklearn
from sklearn.metrics import log_loss
print log_loss(true, pred,normalize=False)
and i have following error:
ValueError: y_true and y_pred have different number of classes 38, 2
It is really strange to me since, the arrays look valid:
print pred.shape
print np.unique(pred)
print np.unique(pred).size
(19191L,)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37]
38
print true.shape
print np.unique(true)
print np.unique(true).size
(19191L,)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37]
38
What is wrong with the log_loss? Why it throws the error?
Sample data:
pred: array([ 0, 1, 2, ..., 3, 12, 16], dtype=int64)
true: array([ 0, 1, 2, ..., 3, 12, 16])
Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value.
The logloss is simply L(pi)=−log(pi) where p is simply the probability attributed to the real class. So L(p)=0 is good, we attributed the probability 1 to the right class, while L(p)=+∞ is bad, because we attributed the probability 0 to the actual class.
Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true .
The log loss can be implemented in Python using the log_loss() function in scikit-learn. In the binary classification case, the function takes a list of true outcome values and a list of probabilities as arguments and calculates the average log loss for the predictions.
It's simple, you are using the prediction and not the probability of your prediction. Your pred
variable contains
[ 1 2 1 3 .... ] #Classes : 1, 2 or 3
but to use log_loss
it should contain something like:
#each element is an array with probability of each class
[[ 0.1, 0.8, 0.1] [ 0.0, 0.79 , 0.21] .... ]
To obtain these probabilities use the function predict_proba
:
pred = model.predict_proba(x_test)
eval = log_loss(y_true,pred)
Inside the log_loss method, true array is fit and transformed by a LabelBinarizer which changes its dimensions. So, the check that true and pred have similar dimensions doesn't mean that log_loss method will work because true's dimensions change. If you just have binary classes, I suggest you use this log_loss cost function else for multiple classes, this method doesn't work.
From the log_loss documentation:
y_pred : array-like of float, shape = (n_samples, n_classes) or (n_samples,)
Predicted probabilities, as returned by a classifier’s predict_proba method. If y_pred.shape = (n_samples,) the probabilities provided are assumed to be that of the positive class. The labels in y_pred are assumed to be ordered alphabetically, as done by preprocessing.LabelBinarizer.
You need to pass probabilities not the prediction labels.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With