Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accuracy of model is 0.86 while AUC is 0.50?

I ran a few models in sklearn. Here is the code for the same.

# Function for Stochastic Gradient Descent Logistic Regression with Elastic Net
def SGDlogistic(k_fold,train_X,train_Y):
        """Method to implement Multi-class SVM using 
        Stochastic Gradient Descent
        """

        from sklearn.linear_model import SGDClassifier
        scores_sgd_lr = []

        for train_indices, test_indices in k_fold:
            train_X_cv = train_X[train_indices]
            train_Y_cv= train_Y[train_indices]

            test_X_cv = train_X[test_indices]
            test_Y_cv= train_Y[test_indices]

            sgd_lr=SGDClassifier(loss='log',penalty='elasticnet')
            scores_sgd_lr.append(sgd_lr.fit(train_X_cv,train_Y_cv).score(test_X_cv,test_Y_cv))

        print("The mean accuracy of Stochastic Gradient Descent Logistic on CV data is:", np.mean(scores_sgd_lr)) 

        return sgd_lr



def test_performance(test_X,test_Y,classifier,name):
        """This method checks the performance of each algorithm on test data."""

        from sklearn import metrics

        # For SGD
        print ("The accuracy of "+ name + " on test data is:",classifier.score(test_X,test_Y))
        print 'Classification Metrics for'
        print metrics.classification_report(test_Y, classifier.predict(test_X))
        print "Confusion matrix"
        print metrics.confusion_matrix(test_Y, classifier.predict(test_X))




def plot_ROC(test_X,test_Y,classifier):
    """ This functions plots the ROC curve of the classifier"""

    from sklearn.metrics import roc_curve, auc
    false_positive_rate, true_positive_rate, thresholds =roc_curve(test_Y, classifier.predict(test_X))
    roc_auc= auc(false_positive_rate, true_positive_rate)
    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate, true_positive_rate, 'b',label='AUC = %0.2f'% roc_auc)
    plt.legend(loc='lower right')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')

The first function does logistic regression with Elastic net penalty. The second function is testing the perfomance of the algo on the test data. This gives confusion matrix and the accuracy.

While plot_ROC plots the ROC curve on test data.

Here is what I see.

('The accuracy of Logistic with Elastic Net on test data is:', 0.90566607467092586)
Classification Metrics for
             precision    recall  f1-score   support

          0       0.91      1.00      0.95    227948
          1       0.50      0.00      0.00     23743

avg / total       0.87      0.91      0.86    251691

Confusion matrix
[[227944      4]
 [ 23739      4]]

enter image description here

(array([ 0.        ,  0.00001755,  1.        ]),
 array([ 0.        ,  0.00016847,  1.        ]),
 array([2, 1, 0]))

If you see , the accuracy on test data 90% and even the confusion matrix shows good precision and recall. So it's not just accuracy which might be misleading. But the ROC and AUC it gives is like 0.50?. That's so weird. It is behaving as a random guess as per ROC while the accuracy and Confusion matrix show a different picture.

Help pls

Edit 2:

Ok. SO I added the code for using probabilities instead of actual classification in AUC.

This is what I get now.

enter image description here

As you see AUC of 0.71. I haven't done anything for class imbalance. One question. How do I convert the prediction scores to probabilities for SVM etc. Currently it has predict_proba for only log loss or Huber Loss functions. That means I can't go beyond Logistic regression to get AUC?

like image 409
Baktaawar Avatar asked Nov 20 '15 01:11

Baktaawar


People also ask

How are accuracy and AUC related?

AUC is the go-to metric in such scenarios as it calibrates the trade-off between sensitivity and specificity at the best-chosen threshold. Further, accuracy measures how well a single model is doing, whereas AUC compares two models as well as evaluates the same model's performance across different thresholds.

Is AUC equal to accuracy?

AUC (or most often AUROC = "area under receiver operating characteristic") and accuracy are different measures, but used for same purpose - to objectively measure performance of a simple binary classifier.

Can AUC be higher than accuracy?

I stumbled upon a 3-class classification problem where all compared classifiers yield a higher AUC than accuracy (usually around 10% higher). This happens both when the dataset is balanced or slightly imbalanced.

What does AUC of 0.8 mean?

In general, an AUC of 0.5 suggests no discrimination (i.e., ability to diagnose patients with and without the disease or condition based on the test), 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding.


1 Answers

Your results seem to indicate the classifier is prediction 0 in almost all cases.

Below is an example where the data is 90% in class 0 and the classifier always predicts 0. It looks very similar to your results.

from sklearn.metrics import confusion_matrix, classification_report
y_true = [0] * 90 + [1] * 10 # 90% Class 0, 10% class 1
y_pred = [0] * 90 + [0] * 10 # All predictions are class 0

print classification_report(y_true, y_pred)

#             precision    recall  f1-score   support
#
#          0       0.90      1.00      0.95        90
#          1       0.00      0.00      0.00        10
#
# avg / total       0.81      0.90      0.85       100

print confusion_matrix(y_true, y_pred)

#[[90  0]
# [10  0]]

print roc_auc_score(y_true, y_pred)

# 0.5

Also, for measuring AUC you should be predicting probabilities using predict_proba instead of predicting labels.

probs = classifier.predict_proba(test_X).T[1]
false_positive_rate, true_positive_rate, thresholds = \
     roc_curve(test_Y, probs)
like image 56
David Maust Avatar answered Oct 10 '22 09:10

David Maust