I have a pandas DataFrame with 86k rows, 5 features and 1 target column. I'm trying to train a DecisionTreeClassifier using 70% of the DataFrame as train data, and I get a MemoryError from the fit method. I've tried changing some of the parameters but I don't really know what's causing the error so I don't know how to handle it. I'm on Windows 10 with 8GB of RAM.
Code
train, test = train_test_split(data, test_size = 0.3)
X_train = train.iloc[:, 1:-1] # first column is not a feature
y_train = train.iloc[:, -1]
X_test = test.iloc[:, 1:-1]
y_test = test.iloc[:, -1]
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
dt_predictions = DT.predict(X_test)
Error
File (...), line 97, in <module>
DT.fit(X_train, y_train)
File "(...)\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
File "(...)\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\tree\tree.py", line 362, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn\trewe\_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn\tree\_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn\tree\_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn\tree\_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn\tree\_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 671612928 bytes
Same error happens when I try the RandomForestClassifier, always in the line that does the fitting. How can I solve this?
The memory usage of the Random Forest depends on the size of a single tree and number of trees. The most straight forward way to reduce memory consumption will be to reduce the number of trees. For example 10 trees will use 10 times less memory than 100 trees.
A random forest classifier. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
I've been running into the same issue. Be sure you're dealing with a Classification problem and not a Regression problem. If your target column is continuous, you might want to use http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html instead of RandomForestClassifier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With