Evaluating Logistic regression with cross validation

Tags:

I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%).

These concepts are totally new to me and am not very sure if am doing it right. I would be grateful if anyone could advise me on the right steps to take where I have gone wrong. Part of my code is shown below.

Also, how can I plot ROCs for "y2" and "y3" on the same graph with the current one?

Thank you

import pandas as pd 
Data=pd.read_csv ('C:\\Dataset.csv',index_col='SNo')
feature_cols=['A','B','C','D','E']
X=Data[feature_cols]

Y=Data['Status'] 
Y1=Data['Status1']  # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere

from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted) 

from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())

from nltk import ConfusionMatrix 
print (ConfusionMatrix(list(y), list(predicted)))
#print (ConfusionMatrix(list(y), list(yexpert)))

# sensitivity:
print (metrics.recall_score(y, predicted) )

import matplotlib.pyplot as plt 
probs = logreg.predict_proba(X)[:, 1] 
plt.hist(probs) 
plt.show()

# use 0.5 cutoff for predicting 'default' 
import numpy as np 
preds = np.where(probs > 0.5, 1, 0) 
print (ConfusionMatrix(list(y), list(preds)))

# check accuracy, sensitivity, specificity 
print (metrics.accuracy_score(y, predicted)) 

#ROC CURVES and AUC 
# plot ROC curve 
fpr, tpr, thresholds = metrics.roc_curve(y, probs) 
plt.plot(fpr, tpr) 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.0]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate)') 
plt.show()

# calculate AUC 
print (metrics.roc_auc_score(y, probs))

# use AUC as evaluation metric for cross-validation 
from sklearn.cross_validation import cross_val_score 
logreg = LogisticRegression() 
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

414

asked Aug 26 '16 09:08

S.H

1 Answers

You got it almost right. cross_validation.cross_val_predict gives you predictions for the entire dataset. You just need to remove logreg.fit earlier in the code. Specifically, what it does is the following: It divides your dataset in to n folds and in each iteration it leaves one of the folds out as the test set and trains the model on the rest of the folds (n-1 folds). So, in the end you will get predictions for the entire data.

Let's illustrate this with one of the built-in datasets in sklearn, iris. This dataset contains 150 training samples with 4 features. iris['data'] is X and iris['target'] is y

In [15]: iris['data'].shape
Out[15]: (150, 4)

To get predictions on the entire set with cross validation you can do the following:

from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
from sklearn import datasets
iris = datasets.load_iris()
predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10)
print metrics.accuracy_score(iris['target'], predicted)

Out [1] : 0.9537

print metrics.classification_report(iris['target'], predicted) 

Out [2] :
                     precision    recall  f1-score   support

                0       1.00      1.00      1.00        50
                1       0.96      0.90      0.93        50
                2       0.91      0.96      0.93        50

      avg / total       0.95      0.95      0.95       150

So, back to your code. All you need is this:

from sklearn import metrics, cross_validation
logreg=LogisticRegression()
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
print metrics.accuracy_score(y, predicted)
print metrics.classification_report(y, predicted)

For plotting ROC in multi-class classification, you can follow this tutorial which gives you something like the following:

In general, sklearn has very good tutorials and documentation. I strongly recommend reading their tutorial on cross_validation.

196

answered Oct 17 '22 15:10

CentAu

Related questions
                            
                                Why does open(True, 'w') print the text like sys.stdout.write?
                            
                                How do I get the "visible" length of a combining Unicode string in Python?
                            
                                Running Python script parallel
                            
                                Modify Sphinx TOC tree
                            
                                Calculate weighted average with pandas dataframe
                            
                                How to manipulate figures while a script is running in Python?
                            
                                sklearn: User defined cross validation for time series data
                            
                                Setting group permissions with python
                            
                                Why is the returned value of cv2.HoughLines of OpenCV for Python need to be accessed with index?
                            
                                Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?
                            
                                LDAP authentication with django REST
                            
                                Gensim word2vec on predefined dictionary and word-indices data
                            
                                Create openCV VideoCapture from interface name instead of camera numbers
                            
                                Why can't I break out of this itertools infinite loop?
                            
                                Is it possible to use a module without installing it on your computer?
                            
                                Is slicing really slower in Python 3.4?
                            
                                Sqlalchemy explicit locking of Postgresql table
                            
                                Returning string matches between two lists for a given number of elements in a third list
                            
                                SSL bad Handshake Error 10054 "WSAECONNRESET"
                            
                                how to add a package to sys path for testing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Evaluating Logistic regression with cross validation

Tags:

python

scikit-learn

logistic-regression

cross-validation

S.H

People also ask

1 Answers

CentAu

Recent Activity

Donate For Us