Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Break up Random forest classification fit into pieces in python?

I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.

So the code I have now is

# This code freezes my computer
rfc.fit(X,Y)

#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on
like image 643
grasshopper Avatar asked Jun 09 '15 21:06

grasshopper


People also ask

How does random forest make splits?

The random forest will split the nodes by selecting features randomly. The final prediction will be selected based on the outcome of the four trees. The outcome chosen by most decision trees will be the final choice.

How do I make my random forest classifier more accurate?

If you want to increase the accuracy of your model, increase the number of trees. Specify the maximum number of features to be included at each node split. This depends very heavily on your dataset. If your independent variables are highly correlated, you'll want to decrease the maximum number of features.

What is Randomforestclassifier in Python?

A random forest classifier. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.


2 Answers

Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.

If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:

# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])

# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])

# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])

Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.

like image 161
Matt Avatar answered Oct 10 '22 19:10

Matt


Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.

I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.

I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.

like image 40
Matt Hancock Avatar answered Oct 10 '22 17:10

Matt Hancock