Break up Random forest classification fit into pieces in python?

Tags:

I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.

So the code I have now is

# This code freezes my computer
rfc.fit(X,Y)

#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on

643

asked Jun 09 '15 21:06

grasshopper

2 Answers

Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.

If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:

# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])

# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])

# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])

Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.

161

answered Oct 10 '22 19:10

Matt

Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.

I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.

I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.

answered Oct 10 '22 17:10

Matt Hancock

Related questions
                            
                                How to load directory of JSON files into Apache Spark in Python
                            
                                Is it possible to yield two things at a time just like return?
                            
                                Python:ValueError: shapes (3,) and (118,1) not aligned: 3 (dim 0) != 118 (dim 0)
                            
                                How to force Google App Engine [python] to use SSL (https)?
                            
                                Python loading 'utf-16' file can't decode '\u0153'
                            
                                Python UnicodeDecodeError when writing German letters
                            
                                How to use unicode symbols in matplotlib?
                            
                                Bokeh - Plotting Data with Gaps
                            
                                Cannot start a console in newly installed Pycharm in Windows
                            
                                Python 2 vs 3: Lambda Operator [duplicate]
                            
                                Python's and Numpy's nan and set
                            
                                Pandas: how to convert a cell with multiple values to multiple rows?
                            
                                No module named osgeo.ogr
                            
                                Python multiprocessing (joblib) best way for argument passing
                            
                                Read a tab separated file with first column as key and the rest as values
                            
                                How can memoized functions be tested?
                            
                                Import only functions from a python file
                            
                                how to delete text to end of line with curses
                            
                                Django model u'id' clashes when using OneToOneField
                            
                                How to reverse query objects for multiple levels in django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Break up Random forest classification fit into pieces in python?

Tags:

python

machine-learning

scikit-learn

grasshopper

People also ask

2 Answers

Matt

Matt Hancock

Recent Activity

Donate For Us