Sequentially fitting Random Forest sklearn

Question

I am training a Random Forest Classifier in python using sklearn on a corpus of image data. Because I am performing image segmentation I have to store the data of every pixel, which ends up being a huge matrix, like 100,000,000 long matrix of data points, and so when running a RF Classifier on that matrix, my computer gets a memory overflow error, and takes forever to run.

One Idea I had was to train the classifier on sequential small batches of the dataset, therefore eventually training on the whole thing but each time improving the fit of the classifier. Is this an idea that could work? Will the fit just override the last fit each time it is run?

SerialDev · Accepted Answer

You can use warm_start in order to pre-compute the trees:

# First build 100 trees on X1, y1
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X1, y1)

# Build 100 additional trees on X2, y2
clf.set_params(n_estimators=200)
clf.fit(X2, y2)

Alternatively

def generate_rf(X_train, y_train, X_test, y_test):
    rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3)
    rf.fit(X_train, y_train)
    print "rf score ", rf.score(X_test, y_test)
    return rf

def combine_rfs(rf_a, rf_b):
    rf_a.estimators_ += rf_b.estimators_
    rf_a.n_estimators = len(rf_a.estimators_)
    return rf_a

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# Create 'n' random forests classifiers
rf_clf = [generate_rf(X_train, y_train, X_test, y_test) for i in range(n)]
# combine classifiers
rf_clf_combined = reduce(combine_rfs, rfs)

Sequentially fitting Random Forest sklearn

Tags:

python

machine-learning

scikit-learn

yodama

1 Answers

SerialDev

Recent Activity

Donate For Us

Sequentially fitting Random Forest sklearn

Tags:

python

machine-learning

scikit-learn

yodama

1 Answers

SerialDev

Related questions

Recent Activity

Donate For Us