Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn balanced subsampling

I'm trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this?

These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers.

In Weka there is tool called spreadsubsample, is there equivalent in sklearn? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

(I know about weighting but that's not what I'm looking for.)

like image 567
mikkom Avatar asked May 04 '14 11:05

mikkom


2 Answers

There now exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn

like image 195
eickenberg Avatar answered Oct 17 '22 00:10

eickenberg


Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)

This function creates single random balanced subsample.

edit: The subsample size now samples down minority classes, this should probably be changed.

def balanced_subsample(x,y,subsample_size=1.0):      class_xs = []     min_elems = None      for yi in np.unique(y):         elems = x[(y == yi)]         class_xs.append((yi, elems))         if min_elems == None or elems.shape[0] < min_elems:             min_elems = elems.shape[0]      use_elems = min_elems     if subsample_size < 1:         use_elems = int(min_elems*subsample_size)      xs = []     ys = []      for ci,this_xs in class_xs:         if len(this_xs) > use_elems:             np.random.shuffle(this_xs)          x_ = this_xs[:use_elems]         y_ = np.empty(use_elems)         y_.fill(ci)          xs.append(x_)         ys.append(y_)      xs = np.concatenate(xs)     ys = np.concatenate(ys)      return xs,ys 

For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:

  1. Replace the np.random.shuffle line with

    this_xs = this_xs.reindex(np.random.permutation(this_xs.index))

  2. Replace the np.concatenate lines with

    xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')

like image 39
mikkom Avatar answered Oct 17 '22 00:10

mikkom