Difference between using train_test_split and cross_val_score in sklearn.cross_validation

Tags:

I have a matrix with 20 columns. The last column are 0/1 labels.

The link to the data is here.

I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:

using sklearn.cross_validation.cross_val_score
using sklearn.cross_validation.train_test_split

I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.

import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]

depth = 5
maxFeat = 3 

result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)

result
# result is now something like array([ 0.66773295,  0.58824739])

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)

RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc    #something like 0.83

RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc    #also something like 0.83

My question is:

why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?

Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.

In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.

Thanks for your help!

988

asked May 21 '15 03:05

evianpring

2 Answers

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

The KFolds iterator has a random state parameter:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

So does train_test_split, which does randomize by default:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

174

answered Oct 07 '22 09:10

KCzar

The answer is what @KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:

p = np.random.permutation(len(X))
X, y = X[p], y[p]

source: Better way to shuffle two numpy arrays in unison

answered Oct 07 '22 07:10

Sajad.sni

Related questions
                            
                                autofmt_xdate deletes x-axis labels of all subplots
                            
                                Get common characters from strings
                            
                                How can I parse Javascript variables using python?
                            
                                Can I pipe a io.BytesIO() stream to subprocess.popen() in Python?
                            
                                Is there a vectorized way to calculate the gradient in sympy?
                            
                                Pure virtual methods in Python
                            
                                Conditional skip TestCase decorator in nosetests
                            
                                Plot topics with bokeh or matplotlib
                            
                                Selenium - get all iframes in a page (even nested ones)?
                            
                                How to make a subprocess.call timeout using python 2.7.6?
                            
                                Get the index that caused an IndexError exception
                            
                                Boolean to string with lowercase
                            
                                Include run-time dependencies in Python wheels
                            
                                Django RelatedObjectDoesNotExist error
                            
                                Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
                            
                                How to manage logging in curses
                            
                                Changing the appearance of a Scrollbar in tkinter (using ttk styles)
                            
                                Improving line-wise I/O operations in D
                            
                                Calculating the number of specific consecutive equal values in a vectorized way in pandas
                            
                                SpooledTemporaryFile: units of maximum (in-memory) size?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

Tags:

python

scikit-learn

cross-validation

evianpring

People also ask

2 Answers

KCzar

Sajad.sni

Recent Activity

Donate For Us