Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using python generators in scikit-learn [closed]

I was wondering whether and how it is possible to use a python generator as data input to scikit-learn classifier's .fit() functions? Due to huge amounts of data, this seems to make sense to me.

In particular I am about to implement a random forest approach.

Regards K

like image 301
Krn Avatar asked Apr 21 '26 03:04

Krn


1 Answers

The answer is "no". To do out of core learning with random forests, you should

  1. Split your data into reasonably-sized batches (restricted by the amount of RAM you have; bigger is better);
  2. train separate random forests;
  3. append all the underlying trees together in the estimators_ member of one of the trees (untested):

    for i in xrange(1, len(forests)):
        forests[0].estimators_.extend(forests[i].estimators_)`
    

(Yes, this is hacky, but no solution to this problem has been found yet. Note that with very large datasets, it might pay to just sample a number training examples that fits in the RAM of a big machine instead of training on all of it. Another option is to switch to linear models with SGD, those implement a partial_fit method, but obviously they're limited in the kind of functions they can learn.)

like image 124
Fred Foo Avatar answered Apr 22 '26 16:04

Fred Foo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!