Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sequentially fitting Random Forest sklearn

I am training a Random Forest Classifier in python using sklearn on a corpus of image data. Because I am performing image segmentation I have to store the data of every pixel, which ends up being a huge matrix, like 100,000,000 long matrix of data points, and so when running a RF Classifier on that matrix, my computer gets a memory overflow error, and takes forever to run.

One Idea I had was to train the classifier on sequential small batches of the dataset, therefore eventually training on the whole thing but each time improving the fit of the classifier. Is this an idea that could work? Will the fit just override the last fit each time it is run?

like image 403
yodama Avatar asked Dec 13 '16 13:12

yodama


1 Answers

You can use warm_start in order to pre-compute the trees:

# First build 100 trees on X1, y1
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X1, y1)

# Build 100 additional trees on X2, y2
clf.set_params(n_estimators=200)
clf.fit(X2, y2)

Alternatively

def generate_rf(X_train, y_train, X_test, y_test):
    rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3)
    rf.fit(X_train, y_train)
    print "rf score ", rf.score(X_test, y_test)
    return rf

def combine_rfs(rf_a, rf_b):
    rf_a.estimators_ += rf_b.estimators_
    rf_a.n_estimators = len(rf_a.estimators_)
    return rf_a

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# Create 'n' random forests classifiers
rf_clf = [generate_rf(X_train, y_train, X_test, y_test) for i in range(n)]
# combine classifiers
rf_clf_combined = reduce(combine_rfs, rfs)
like image 68
SerialDev Avatar answered Oct 20 '22 22:10

SerialDev