How do I generate random folds for cross-validation in scikit-learn?
Imagine we have 20 samples of one class, and 80 of the other, and we need to generate N train and test sets, each train set of the size 30, under the constraint that in each training set, the we have 50% of class one and 50% of class 2.
I found this discussion (https://github.com/scikit-learn/scikit-learn/issues/1362) but I don't understand how to get folds. Ideally I think I need such a function:
cfolds = np.cross_validation.imaginaryfunction(
[list(itertools.repeat(1,20)), list(itertools.repeat(2,80))],
n_iter=100, test_size=0.70)
What am I missing?
There is no direct way to do crossvalidation with undersampling in scikit, but there are two workarounds:
1.
Use StratifiedCrossValidation
to achieve cross validation with distribution in each fold mirroring the distribution of data, then you can achieve imbalance reduction in classifiers via the class_weight
param which can either take auto
and undersample/oversample classes inversely proportional to their count or you can pass a dictionary with explicit weights.
2.
Write your own cross validation routine, which should be pretty straight forward using pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With