Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Discrepancy in Scikit Learn Stratified Cross Validation

I am seeing a discrepancy between classification performance between two cross validation technique using the same data. I was wondering if anyone can shed some light on this.

  • Method 1: cross_validation.train_test_split
  • Method 2: StratifiedKFold.

Two Examples with same data set

Data Set 5500[n_samples :: Class 1 = 500 ; Class 0 = 5000 ] by 193 Features

Method 1 [ Random Iteration with train_test_split ]

for i in range(0,5):
    X_tr, X_te, y_tr, y_te = cross_validation.train_test_split(X_train.values, y_train, test_size=0.2, random_state=i)
    clf = RandomForestClassifier(n_estimators=250, max_depth=None, min_samples_split=1, random_state=0, oob_score=True)
    y_score = clf.fit(X_tr, y_tr).predict(X_te)
    y_prob = clf.fit(X_tr, y_tr).predict_proba(X_te)
    cm = confusion_matrix(y_te, y_score)
    print cm
    fpr, tpr, thresholds = roc_curve(y_te,y_prob[:,1])
    roc_auc = auc(fpr, tpr);
    print "ROC AUC: ", roc_auc

Result of method 1

Iteration 1 ROC AUC:  0.91
[[998   4]
 [ 42  56]]

Iteration 5 ROC AUC:  0.88
[[1000    3]
 [  35   62]]

Method 2 [ StratifiedKFold cross validation ]

cv = StratifiedKFold(y_train, n_folds=5,random_state=None,shuffle=False)
clf = RandomForestClassifier(n_estimators=250, max_depth=None, min_samples_split=1, random_state=None, oob_score=True)
for train, test in cv:
    y_score = clf.fit(X_train.values[train], y_train[train]).predict(X_train.values[test])
    y_prob = clf.fit(X_train.values[train], y_train[train]).predict_proba(X_train.values[test])
    cm = confusion_matrix(y_train[test], y_score)
    print cm
    fpr, tpr, thresholds = roc_curve(y_train[test],y_prob[:,1])
    roc_auc = auc(fpr, tpr);
    print "ROC AUC: ", roc_auc

Result of method 2

Fold 1 ROC AUC:  0.76
Fold 1 Confusion Matrix
[[995   5]
 [ 92   8]]

Fold 5 ROC AUC:  0.77
Fold 5 Confusion Matrix
[[986  14]
 [ 76  23]]
like image 871
Mamun Rashid Avatar asked Jan 26 '26 08:01

Mamun Rashid


1 Answers

train_test_split is not stratified. In the current development version and in the upcoming 0.17 you can do stratify=y to make it stratified, but not in 0.16.2. Also the random states are not fixed and used differently, so you can't expect the exact same outcome.

like image 187
Andreas Mueller Avatar answered Jan 29 '26 05:01

Andreas Mueller



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!