Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MemoryError when fitting scikit-learn Decision Tree and Random Forest Classifiers

I have a pandas DataFrame with 86k rows, 5 features and 1 target column. I'm trying to train a DecisionTreeClassifier using 70% of the DataFrame as train data, and I get a MemoryError from the fit method. I've tried changing some of the parameters but I don't really know what's causing the error so I don't know how to handle it. I'm on Windows 10 with 8GB of RAM.

Code

train, test = train_test_split(data, test_size = 0.3)
X_train = train.iloc[:, 1:-1] # first column is not a feature
y_train = train.iloc[:, -1]
X_test = test.iloc[:, 1:-1]
y_test = test.iloc[:, -1]

DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
dt_predictions = DT.predict(X_test)

Error

File (...), line 97, in <module>
DT.fit(X_train, y_train)
File "(...)\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
File "(...)\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\tree\tree.py", line 362, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn\trewe\_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn\tree\_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn\tree\_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn\tree\_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn\tree\_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 671612928 bytes

Same error happens when I try the RandomForestClassifier, always in the line that does the fitting. How can I solve this?

like image 553
julia Avatar asked Jun 21 '18 18:06

julia


People also ask

How much memory does a random forest use?

The memory usage of the Random Forest depends on the size of a single tree and number of trees. The most straight forward way to reduce memory consumption will be to reduce the number of trees. For example 10 trees will use 10 times less memory than 100 trees.

What is the Randomforestclassifier model in Sklearn?

A random forest classifier. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.


1 Answers

I've been running into the same issue. Be sure you're dealing with a Classification problem and not a Regression problem. If your target column is continuous, you might want to use http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html instead of RandomForestClassifier.

like image 149
Teuszie Avatar answered Nov 02 '22 13:11

Teuszie