StratifiedKFold vs StratifiedShuffleSplit vs StratifiedKFold + Shuffle

Tags:

What is the difference between: StratifiedKFold, StratifiedShuffleSplit, StratifiedKFold + Shuffle? When should I use each one? When I get a better accuracy score? Why I do not get similar results? I have put my code and the results. I am using Naive Bayes and 10x10 cross-validation.

   #######SKF FOR LOOP########
from sklearn.cross_validation import StratifiedKFold
for i in range(10):
    skf = StratifiedKFold(y, n_folds=10, shuffle=True)
    scoresSKF2 = cross_validation.cross_val_score(clf, x, y , cv=skf)
    print(scoresSKF2)
    print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSKF2.mean(), scoresSKF2.std()* 2))
    print("") 

    [ 0.1750503   0.16834532  0.16417051  0.18205424  0.1625758   0.1750939
      0.15495808  0.1712963   0.17096494  0.16918166]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.16297787  0.17956835  0.17309908  0.17686093  0.17239388  0.16093615
     0.16970223  0.16956019  0.15473776  0.17208358]
   Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17102616  0.16719424  0.1733871   0.16560877  0.166041    0.16122508
     0.16767852  0.17042824  0.18719212  0.1677307 ]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17275079  0.16633094  0.16906682  0.17570687  0.17210511  0.15515747
     0.16594391  0.18113426  0.16285135  0.1746953 ]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.1764875   0.17035971  0.16186636  0.1644547   0.16632977  0.16469229
     0.17635155  0.17158565  0.17849899  0.17005223]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16815177  0.16863309  0.17309908  0.17368725  0.17152758  0.16093615
     0.17143683  0.17158565  0.16574906  0.16511898]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16786433  0.16690647  0.17309908  0.17022504  0.17066128  0.16613695
     0.17259324  0.17737269  0.16256158  0.17643645]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.16297787  0.16402878  0.17684332  0.16791691  0.16950621  0.1716267
      0.18328997  0.16984954  0.15792524  0.17701683]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16958896  0.16633094  0.17165899  0.17080208  0.16026567  0.17538284
     0.17490604  0.16840278  0.17502173  0.16511898]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17275079  0.15625899  0.17713134  0.16762839  0.18278949  0.16729269
     0.16449841  0.17303241  0.16111272  0.1610563 ]
    Accuracy SKF_NB: 0.17 (*/- 0.02)


  #####StratifiedKFold + Shuffle######
  from sklearn.utils import shuffle
  for i in range(10):
      X, y = shuffle(x, y, random_state=i)
      skf = StratifiedKFold(y, 10)
      scoresSKF2 = cross_validation.cross_val_score(clf, X, y , cv=skf)
      print(scoresSKF2)
      print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSKF2.mean(), scoresSKF2.std()* 2))
      print("")

   [ 0.16700201  0.15913669  0.16359447  0.17772649  0.17297141  0.16931523
    0.17172593  0.18576389  0.17125471  0.16134649]
    Accuracy SKF_NB: 0.17 (*/- 0.02)

    [ 0.02874389  0.02705036  0.02592166  0.02740912  0.02714409  0.02687085
     0.02891009  0.02922454  0.0260794   0.02814858]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.0221328   0.02848921  0.02361751  0.02942874  0.02598903  0.02947125
     0.02804279  0.02719907  0.02376123  0.02205456]
    Accuracy SKF_NB: 0.03 (*/- 0.01)

   [ 0.02788158  0.02848921  0.03081797  0.03289094  0.02829916  0.03293846
     0.02862099  0.02633102  0.03245436  0.02843877]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02874389  0.0247482   0.02448157  0.02625505  0.02483396  0.02860445
     0.02948829  0.02604167  0.02665894  0.0275682 ]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.0221328   0.02705036  0.02476959  0.02510098  0.02454519  0.02687085
      0.02254987  0.02199074  0.02492031  0.02524666]
    Accuracy SKF_NB: 0.02 (*/- 0.00)

    [ 0.02615694  0.03079137  0.02102535  0.03029429  0.02252382  0.02889338
       0.02197167  0.02604167  0.02752825  0.02843877]
    Accuracy SKF_NB: 0.03 (*/- 0.01)

    [ 0.02673182  0.02676259  0.03197005  0.03115984  0.02512273  0.03236059
      0.02688638  0.02372685  0.03216459  0.02698781]
     Accuracy SKF_NB: 0.03 (*/- 0.01)

    [ 0.0258695   0.02964029  0.03081797  0.02740912  0.02916546  0.02976018
      0.02717548  0.02922454  0.02694871  0.0275682 ]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.03506755  0.0247482   0.02592166  0.02740912  0.02772163  0.02773765
      0.02948829  0.0234375   0.03332367  0.02118398]
    Accuracy SKF_NB: 0.03 (*/- 0.01)


    ######StratifiedShuffleSplit##########
   from sklearn.cross_validation import StratifiedShuffleSplit
   for i in range(10):
       sss = StratifiedShuffleSplit(y, 10, test_size=0.1, random_state=0)
       scoresSSS = cross_validation.cross_val_score(clf, x, y , cv=sss)
       print(scoresSSS)
       print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSSS.mean(), scoresSSS.std()* 2))
       print("")
    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
      Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

530

asked Jun 04 '16 22:06

Aizzaac

1 Answers

Its hard to say which one is better. The choice should be more about your strategy and goals of your modeling, however there is a strong preference in the community for using K-fold cross-validation for both model selection and performance estimation. I will try to give you some intuition for the two main concepts that will guide your choice of sampling techniques: Stratification and Cross Validation/Random Split.

Also keep in mind you can use these sampling techniques for two very different goals: model selection and performance estimation.

Stratification works by keeping the balance or ratio between labels/targets of the dataset. So if your whole dataset has two labels (e.g. Positive and Negative) and these have a 30/70 ratio, and you split in 10 subsamples, each stratified subsample should keep the same ratio. Reasoning: Because machine-learned model performance in general are very sensitive to a sample balancing, using this strategy often makes the models more stable for subsamples.

Split vs Random Split. A split is just a split, usually for the purpose of having separate training and testing subsamples. But taking the first X% for a subsample and the remaining for another subsample might not be a good idea because it can introduce very high bias. There is where random split comes into play, by introducing randomness for the subsampling.

K-fold Cross-validation vs Random Split. K folds comprises in creating K subsamples. Because you now have a more significant number of samples (instead of 2), you can separate one of the subsamples for testing and the remaining subsamples for training, do this for every possible combination of testing/training folds and average the results. This is known as cross-validation. Doing K-fold cross-validation is like doing a (not random) split k times, and then averaging. A small sample might not benefit from k-fold cross-validation, while a large sample usually always do benefit from cross-validation. A random split is a more efficient (faster) way of estimating, but might be more prone to sampling bias than k-fold cross-validation. Combining stratification and random split is an attempt to have an effective and efficient sampling strategy that preserves label distribution.

answered Oct 12 '22 09:10

Rabbit

Related questions
                            
                                Python argparse example?
                            
                                Python str view
                            
                                How to Sort Python Objects
                            
                                Python MD5 Cracker "TypeError: object supporting the buffer API required"
                            
                                Using a C function in Python
                            
                                AttributeError: module 'pandas' has no attribute 'read_csv' Python3.5
                            
                                How to gracefully timeout with asyncio
                            
                                Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?
                            
                                How to read merged Excel cells with NaN into Pandas DataFrame
                            
                                How to manually close a websocket
                            
                                Why Pearson correlation is different between Tensorflow and Scipy
                            
                                How to fix "No matching distribution found for {package name}" when installing own package from test.pypi [duplicate]
                            
                                subclassing from OrderedDict and defaultdict
                            
                                Which Twitter wrapper libs support Python 3.x?
                            
                                Python unittest data provider
                            
                                Create a new type in python [closed]
                            
                                Is it pythonic to use generators to write header and body of a file?
                            
                                Deterministic hashing in Python 3
                            
                                Why does open(True, 'w') print the text like sys.stdout.write?
                            
                                Why can't I break out of this itertools infinite loop?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

StratifiedKFold vs StratifiedShuffleSplit vs StratifiedKFold + Shuffle

Tags:

python-3.x

scikit-learn

cross-validation

Aizzaac

People also ask

1 Answers

Rabbit

Recent Activity

Donate For Us