I wrote a confusion matrix calculation code in Python:
def conf_mat(prob_arr, input_arr): # confusion matrix conf_arr = [[0, 0], [0, 0]] for i in range(len(prob_arr)): if int(input_arr[i]) == 1: if float(prob_arr[i]) < 0.5: conf_arr[0][1] = conf_arr[0][1] + 1 else: conf_arr[0][0] = conf_arr[0][0] + 1 elif int(input_arr[i]) == 2: if float(prob_arr[i]) >= 0.5: conf_arr[1][0] = conf_arr[1][0] +1 else: conf_arr[1][1] = conf_arr[1][1] +1 accuracy = float(conf_arr[0][0] + conf_arr[1][1])/(len(input_arr))
prob_arr is an array that my classification code returned and a sample array is like this:
[1.0, 1.0, 1.0, 0.41592955657342651, 1.0, 0.0053405015805891975, 4.5321494433440449e-299, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.70943426182688163, 1.0, 1.0, 1.0, 1.0]
input_arr is the original class labels for a dataset and it is like this:
[2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1]
What my code is trying to do is: i get prob_arr and input_arr and for each class (1 and 2) I check if they are misclassified or not.
But my code only works for two classes. If I run this code for a multiple classed data, it doesn't work. How can I make this for multiple classes?
For example, for a data set with three classes, it should return: [[21,7,3],[3,38,6],[5,4,19]]
You can derive the confusion matrix by counting the number of instances in each combination of actual and predicted classes as follows: import numpy as np def comp_confmat(actual, predicted): # extract the different classes classes = np. unique(actual) # initialize the confusion matrix confmat = np.
Scikit-Learn provides a confusion_matrix
function
from sklearn.metrics import confusion_matrix y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2] y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2] confusion_matrix(y_actu, y_pred)
which output a Numpy array
array([[3, 0, 0], [0, 1, 2], [2, 1, 3]])
But you can also create a confusion matrix using Pandas:
import pandas as pd y_actu = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], name='Actual') y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2], name='Predicted') df_confusion = pd.crosstab(y_actu, y_pred)
You will get a (nicely labeled) Pandas DataFrame:
Predicted 0 1 2 Actual 0 3 0 0 1 0 1 2 2 2 1 3
If you add margins=True
like
df_confusion = pd.crosstab(y_actu, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
you will get also sum for each row and column:
Predicted 0 1 2 All Actual 0 3 0 0 3 1 0 1 2 3 2 2 1 3 6 All 5 2 5 12
You can also get a normalized confusion matrix using:
df_conf_norm = df_confusion / df_confusion.sum(axis=1) Predicted 0 1 2 Actual 0 1.000000 0.000000 0.000000 1 0.000000 0.333333 0.333333 2 0.666667 0.333333 0.500000
You can plot this confusion_matrix using
import matplotlib.pyplot as plt def plot_confusion_matrix(df_confusion, title='Confusion matrix', cmap=plt.cm.gray_r): plt.matshow(df_confusion, cmap=cmap) # imshow #plt.title(title) plt.colorbar() tick_marks = np.arange(len(df_confusion.columns)) plt.xticks(tick_marks, df_confusion.columns, rotation=45) plt.yticks(tick_marks, df_confusion.index) #plt.tight_layout() plt.ylabel(df_confusion.index.name) plt.xlabel(df_confusion.columns.name) plot_confusion_matrix(df_confusion)
Or plot normalized confusion matrix using:
plot_confusion_matrix(df_conf_norm)
You might also be interested by this project https://github.com/pandas-ml/pandas-ml and its Pip package https://pypi.python.org/pypi/pandas_ml
With this package confusion matrix can be pretty-printed, plot. You can binarize a confusion matrix, get class statistics such as TP, TN, FP, FN, ACC, TPR, FPR, FNR, TNR (SPC), LR+, LR-, DOR, PPV, FDR, FOR, NPV and some overall statistics
In [1]: from pandas_ml import ConfusionMatrix In [2]: y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2] In [3]: y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2] In [4]: cm = ConfusionMatrix(y_actu, y_pred) In [5]: cm.print_stats() Confusion Matrix: Predicted 0 1 2 __all__ Actual 0 3 0 0 3 1 0 1 2 3 2 2 1 3 6 __all__ 5 2 5 12 Overall Statistics: Accuracy: 0.583333333333 95% CI: (0.27666968568210581, 0.84834777019156982) No Information Rate: ToDo P-Value [Acc > NIR]: 0.189264302376 Kappa: 0.354838709677 Mcnemar's Test P-Value: ToDo Class Statistics: Classes 0 1 2 Population 12 12 12 P: Condition positive 3 3 6 N: Condition negative 9 9 6 Test outcome positive 5 2 5 Test outcome negative 7 10 7 TP: True Positive 3 1 3 TN: True Negative 7 8 4 FP: False Positive 2 1 2 FN: False Negative 0 2 3 TPR: (Sensitivity, hit rate, recall) 1 0.3333333 0.5 TNR=SPC: (Specificity) 0.7777778 0.8888889 0.6666667 PPV: Pos Pred Value (Precision) 0.6 0.5 0.6 NPV: Neg Pred Value 1 0.8 0.5714286 FPR: False-out 0.2222222 0.1111111 0.3333333 FDR: False Discovery Rate 0.4 0.5 0.4 FNR: Miss Rate 0 0.6666667 0.5 ACC: Accuracy 0.8333333 0.75 0.5833333 F1 score 0.75 0.4 0.5454545 MCC: Matthews correlation coefficient 0.6831301 0.2581989 0.1690309 Informedness 0.7777778 0.2222222 0.1666667 Markedness 0.6 0.3 0.1714286 Prevalence 0.25 0.25 0.5 LR+: Positive likelihood ratio 4.5 3 1.5 LR-: Negative likelihood ratio 0 0.75 0.75 DOR: Diagnostic odds ratio inf 4 2 FOR: False omission rate 0 0.2 0.4285714
I noticed that a new Python library about Confusion Matrix named PyCM is out: maybe you can have a look.
Nearly a decade has passed, yet the solutions (without sklearn) to this post are convoluted and unnecessarily long. Computing a confusion matrix can be done cleanly in Python in a few lines. For example:
import numpy as np def compute_confusion_matrix(true, pred): '''Computes a confusion matrix using numpy for two np.arrays true and pred. Results are identical (and similar in computation time) to: "from sklearn.metrics import confusion_matrix" However, this function avoids the dependency on sklearn.''' K = len(np.unique(true)) # Number of classes result = np.zeros((K, K)) for i in range(len(true)): result[true[i]][pred[i]] += 1 return result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With