I calculated a confusion matrix for my classifier using confusion_matrix()
from scikit-learn. The diagonal elements of the confusion matrix represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.
I would like to normalize my confusion matrix so that it contains only numbers between 0 and 1. I would like to read the percentage of correctly classified samples from the matrix.
I found several methods how to normalize a matrix (row and column normalization) but I don't know much about maths and am not sure if this is the correct approach.
4.7 Confusion matrix patterns The “normalized” term means that each of these groupings is represented as having 1.00 samples. Thus, the sum of each row in a balanced and normalized confusion matrix is 1.00, because each row sum represents 100% of the elements in a particular topic, cluster, or class.
The best accuracy is 1.0, whereas the worst is 0.0. It can also be calculated by 1 – ERR. Accuracy is calculated as the total number of two correct predictions (TP + TN) divided by the total number of a dataset (P + N).
Suppose that
>>> y_true = [0, 0, 1, 1, 2, 0, 1] >>> y_pred = [0, 1, 0, 1, 2, 2, 1] >>> C = confusion_matrix(y_true, y_pred) >>> C array([[1, 1, 1], [1, 2, 0], [0, 0, 1]])
Then, to find out how many samples per class have received their correct label, you need
>>> C / C.astype(np.float).sum(axis=1) array([[ 0.33333333, 0.33333333, 1. ], [ 0.33333333, 0.66666667, 0. ], [ 0. , 0. , 1. ]])
The diagonal contains the required values. Another way to compute these is to realize that what you're computing is the recall per class:
>>> from sklearn.metrics import precision_recall_fscore_support >>> _, recall, _, _ = precision_recall_fscore_support(y_true, y_pred) >>> recall array([ 0.33333333, 0.66666667, 1. ])
Similarly, if you divide by the sum over axis=0
, you get the precision (fraction of class-k
predictions that have ground truth label k
):
>>> C / C.astype(np.float).sum(axis=0) array([[ 0.5 , 0.33333333, 0.5 ], [ 0.5 , 0.66666667, 0. ], [ 0. , 0. , 0.5 ]]) >>> prec, _, _, _ = precision_recall_fscore_support(y_true, y_pred) >>> prec array([ 0.5 , 0.66666667, 0.5 ])
From the sklearn documentation (plot example)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
where cm is the confusion matrix as provided by sklearn.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With