My problem:
I have a dataset which is a large JSON file. I read it and store it in the trainList
variable.
Next, I pre-process it - in order to be able to work with it.
Once I have done that I start the classification:
kfold
cross validation method in order to obtain the mean
accuracy and train a classifier.True Positive(TP)
, True Negative(TN)
, False Positive(FP)
and False Negative(FN)
values. I'll use these parameters to obtain the Sensitivity and Specificity. Finally, I would use this to put in HTML in order to show a chart with the TPs of each label.
Code:
The variables I have for the moment:
trainList #It is a list with all the data of my dataset in JSON form
labelList #It is a list with all the labels of my data
Most part of the method:
#I transform the data from JSON form to a numerical one
X=vec.fit_transform(trainList)
#I scale the matrix (don't know why but without it, it makes an error)
X=preprocessing.scale(X.toarray())
#I generate a KFold in order to make cross validation
kf = KFold(len(X), n_folds=10, indices=True, shuffle=True, random_state=1)
#I start the cross validation
for train_indices, test_indices in kf:
X_train=[X[ii] for ii in train_indices]
X_test=[X[ii] for ii in test_indices]
y_train=[listaLabels[ii] for ii in train_indices]
y_test=[listaLabels[ii] for ii in test_indices]
#I train the classifier
trained=qda.fit(X_train,y_train)
#I make the predictions
predicted=qda.predict(X_test)
#I obtain the accuracy of this fold
ac=accuracy_score(predicted,y_test)
#I obtain the confusion matrix
cm=confusion_matrix(y_test, predicted)
#I should calculate the TP,TN, FP and FN
#I don't know how to continue
It's calculated as FN/FN+TP, where FN is the number of false negatives and TP is the number of true positives (FN+TP being the total number of positives). The true positive rate (TPR, also called sensitivity) is calculated as TP/TP+FN. TPR is the probability that an actual positive will test positive.
A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class. A false positive is an outcome where the model incorrectly predicts the positive class.
For the multi-class case, everything you need can be found from the confusion matrix. For example, if your confusion matrix looks like this:
Then what you're looking for, per class, can be found like this:
Using pandas/numpy, you can do this for all classes at once like so:
FP = confusion_matrix.sum(axis=0) - np.diag(confusion_matrix)
FN = confusion_matrix.sum(axis=1) - np.diag(confusion_matrix)
TP = np.diag(confusion_matrix)
TN = confusion_matrix.values.sum() - (FP + FN + TP)
# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP)
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)
# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
If you have two lists that have the predicted and actual values; as it appears you do, you can pass them to a function that will calculate TP, FP, TN, FN with something like this:
def perf_measure(y_actual, y_hat):
TP = 0
FP = 0
TN = 0
FN = 0
for i in range(len(y_hat)):
if y_actual[i]==y_hat[i]==1:
TP += 1
if y_hat[i]==1 and y_actual[i]!=y_hat[i]:
FP += 1
if y_actual[i]==y_hat[i]==0:
TN += 1
if y_hat[i]==0 and y_actual[i]!=y_hat[i]:
FN += 1
return(TP, FP, TN, FN)
From here I think you will be able to calculate rates of interest to you, and other performance measure like specificity and sensitivity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With