I have a matrix with 20 columns. The last column are 0/1 labels.
The link to the data is here.
I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:
sklearn.cross_validation.cross_val_score
sklearn.cross_validation.train_test_split
I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.
import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]
depth = 5
maxFeat = 3
result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)
result
# result is now something like array([ 0.66773295, 0.58824739])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)
RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc #something like 0.83
RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc #also something like 0.83
My question is:
why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split
?
Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.
In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.
Thanks for your help!
Cross_val_score runs single metric cross validation whilst cross_validate runs multi metric. This means that cross_val_score will only accept a single metric and return this for each fold, whilst cross_validate accepts a list of multiple metrics and will return all these for each fold.
cross_validation. train_test_split. Quick utility that wraps calls to check_arrays and next(iter(ShuffleSplit(n_samples))) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner. Python lists or tuples occurring in arrays are converted to 1D numpy arrays.
The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.
score() method will return the mean accuracy. With cross_val_score you are comparing one RandomForestClassifier model with some hyperparameters to another with different hyperparameters and selecting the best.
When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:
http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics
http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold
By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.
The KFolds iterator has a random state parameter:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
So does train_test_split, which does randomize by default:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.
The answer is what @KCzar pointed. Just want to note the easiest way I found to randomize data(X
and y
with the same index shuffling) is as following:
p = np.random.permutation(len(X))
X, y = X[p], y[p]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With