Discrepancy in Scikit Learn Stratified Cross Validation

Question

I am seeing a discrepancy between classification performance between two cross validation technique using the same data. I was wondering if anyone can shed some light on this.

Method 1: cross_validation.train_test_split
Method 2: StratifiedKFold.

Two Examples with same data set

Data Set 5500[n_samples :: Class 1 = 500 ; Class 0 = 5000 ] by 193 Features

Method 1 [ Random Iteration with train_test_split ]

for i in range(0,5):
    X_tr, X_te, y_tr, y_te = cross_validation.train_test_split(X_train.values, y_train, test_size=0.2, random_state=i)
    clf = RandomForestClassifier(n_estimators=250, max_depth=None, min_samples_split=1, random_state=0, oob_score=True)
    y_score = clf.fit(X_tr, y_tr).predict(X_te)
    y_prob = clf.fit(X_tr, y_tr).predict_proba(X_te)
    cm = confusion_matrix(y_te, y_score)
    print cm
    fpr, tpr, thresholds = roc_curve(y_te,y_prob[:,1])
    roc_auc = auc(fpr, tpr);
    print "ROC AUC: ", roc_auc

Result of method 1

Iteration 1 ROC AUC:  0.91
[[998   4]
 [ 42  56]]

Iteration 5 ROC AUC:  0.88
[[1000    3]
 [  35   62]]

Method 2 [ StratifiedKFold cross validation ]

cv = StratifiedKFold(y_train, n_folds=5,random_state=None,shuffle=False)
clf = RandomForestClassifier(n_estimators=250, max_depth=None, min_samples_split=1, random_state=None, oob_score=True)
for train, test in cv:
    y_score = clf.fit(X_train.values[train], y_train[train]).predict(X_train.values[test])
    y_prob = clf.fit(X_train.values[train], y_train[train]).predict_proba(X_train.values[test])
    cm = confusion_matrix(y_train[test], y_score)
    print cm
    fpr, tpr, thresholds = roc_curve(y_train[test],y_prob[:,1])
    roc_auc = auc(fpr, tpr);
    print "ROC AUC: ", roc_auc

Result of method 2

Fold 1 ROC AUC:  0.76
Fold 1 Confusion Matrix
[[995   5]
 [ 92   8]]

Fold 5 ROC AUC:  0.77
Fold 5 Confusion Matrix
[[986  14]
 [ 76  23]]

Andreas Mueller · Accepted Answer

train_test_split is not stratified. In the current development version and in the upcoming 0.17 you can do stratify=y to make it stratified, but not in 0.16.2. Also the random states are not fixed and used differently, so you can't expect the exact same outcome.

Discrepancy in Scikit Learn Stratified Cross Validation

Tags:

scikit-learn

random-forest

cross-validation

Two Examples with same data set

Method 1 [ Random Iteration with train_test_split ]

Result of method 1

Method 2 [ StratifiedKFold cross validation ]

Result of method 2

Mamun Rashid

1 Answers

Andreas Mueller

Recent Activity

Donate For Us

Discrepancy in Scikit Learn Stratified Cross Validation

Tags:

scikit-learn

random-forest

cross-validation

Two Examples with same data set

Method 1 [ Random Iteration with train_test_split ]

Result of method 1

Method 2 [ StratifiedKFold cross validation ]

Result of method 2

Mamun Rashid

1 Answers

Andreas Mueller

Related questions

Recent Activity

Donate For Us