I am training a Random Forest Classifier in python using sklearn on a corpus of image data. Because I am performing image segmentation I have to store the data of every pixel, which ends up being a huge matrix, like 100,000,000 long matrix of data points, and so when running a RF Classifier on that matrix, my computer gets a memory overflow error, and takes forever to run.
One Idea I had was to train the classifier on sequential small batches of the dataset, therefore eventually training on the whole thing but each time improving the fit of the classifier. Is this an idea that could work? Will the fit just override the last fit each time it is run?
You can use warm_start
in order to pre-compute the trees:
# First build 100 trees on X1, y1
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X1, y1)
# Build 100 additional trees on X2, y2
clf.set_params(n_estimators=200)
clf.fit(X2, y2)
Alternatively
def generate_rf(X_train, y_train, X_test, y_test):
rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3)
rf.fit(X_train, y_train)
print "rf score ", rf.score(X_test, y_test)
return rf
def combine_rfs(rf_a, rf_b):
rf_a.estimators_ += rf_b.estimators_
rf_a.n_estimators = len(rf_a.estimators_)
return rf_a
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# Create 'n' random forests classifiers
rf_clf = [generate_rf(X_train, y_train, X_test, y_test) for i in range(n)]
# combine classifiers
rf_clf_combined = reduce(combine_rfs, rfs)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With