Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

I have a matrix with 20 columns. The last column are 0/1 labels.

The link to the data is here.

I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:

  1. using sklearn.cross_validation.cross_val_score
  2. using sklearn.cross_validation.train_test_split

I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.

import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]

depth = 5
maxFeat = 3 

result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)

result
# result is now something like array([ 0.66773295,  0.58824739])

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)

RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc    #something like 0.83

RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc    #also something like 0.83

My question is:

why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?

Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.

In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.

Thanks for your help!

like image 988
evianpring Avatar asked May 21 '15 03:05

evianpring


People also ask

What is the difference between Cross_validate and Cross_val_score?

Cross_val_score runs single metric cross validation whilst cross_validate runs multi metric. This means that cross_val_score will only accept a single metric and return this for each fold, whilst cross_validate accepts a list of multiple metrics and will return all these for each fold.

What does sklearn Cross_validation Train_test_split do?

cross_validation. train_test_split. Quick utility that wraps calls to check_arrays and next(iter(ShuffleSplit(n_samples))) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner. Python lists or tuples occurring in arrays are converted to 1D numpy arrays.

What does sklearn Cross_val_score do?

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.

What does Cross_val_score return?

score() method will return the mean accuracy. With cross_val_score you are comparing one RandomForestClassifier model with some hyperparameters to another with different hyperparameters and selecting the best.


2 Answers

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

The KFolds iterator has a random state parameter:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

So does train_test_split, which does randomize by default:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

like image 174
KCzar Avatar answered Oct 07 '22 09:10

KCzar


The answer is what @KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:

p = np.random.permutation(len(X))
X, y = X[p], y[p]

source: Better way to shuffle two numpy arrays in unison

like image 22
Sajad.sni Avatar answered Oct 07 '22 07:10

Sajad.sni