I am seeing a discrepancy between classification performance between two cross validation technique using the same data. I was wondering if anyone can shed some light on this.
Data Set 5500[n_samples :: Class 1 = 500 ; Class 0 = 5000 ] by 193 Features
for i in range(0,5):
X_tr, X_te, y_tr, y_te = cross_validation.train_test_split(X_train.values, y_train, test_size=0.2, random_state=i)
clf = RandomForestClassifier(n_estimators=250, max_depth=None, min_samples_split=1, random_state=0, oob_score=True)
y_score = clf.fit(X_tr, y_tr).predict(X_te)
y_prob = clf.fit(X_tr, y_tr).predict_proba(X_te)
cm = confusion_matrix(y_te, y_score)
print cm
fpr, tpr, thresholds = roc_curve(y_te,y_prob[:,1])
roc_auc = auc(fpr, tpr);
print "ROC AUC: ", roc_auc
Iteration 1 ROC AUC: 0.91
[[998 4]
[ 42 56]]
Iteration 5 ROC AUC: 0.88
[[1000 3]
[ 35 62]]
cv = StratifiedKFold(y_train, n_folds=5,random_state=None,shuffle=False)
clf = RandomForestClassifier(n_estimators=250, max_depth=None, min_samples_split=1, random_state=None, oob_score=True)
for train, test in cv:
y_score = clf.fit(X_train.values[train], y_train[train]).predict(X_train.values[test])
y_prob = clf.fit(X_train.values[train], y_train[train]).predict_proba(X_train.values[test])
cm = confusion_matrix(y_train[test], y_score)
print cm
fpr, tpr, thresholds = roc_curve(y_train[test],y_prob[:,1])
roc_auc = auc(fpr, tpr);
print "ROC AUC: ", roc_auc
Fold 1 ROC AUC: 0.76
Fold 1 Confusion Matrix
[[995 5]
[ 92 8]]
Fold 5 ROC AUC: 0.77
Fold 5 Confusion Matrix
[[986 14]
[ 76 23]]
train_test_split is not stratified. In the current development version and in the upcoming 0.17 you can do stratify=y to make it stratified, but not in 0.16.2.
Also the random states are not fixed and used differently, so you can't expect the exact same outcome.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With