I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?
The data a is
x,y, day, week, Accuracy
x and y are the coordinates day is which day of the month (1-30) the week is which day of the week (1-7) and accuracy is an integer
code:
import csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier
with open("time_data.csv", "rb") as infile:
re1 = csv.reader(infile)
result=[]
##next(reader, None)
##for row in reader:
for row in re1:
result.append(row[8])
trainclass = result[:251900]
testclass = result[251901:279953]
with open("time_data.csv", "rb") as infile:
re = csv.reader(infile)
coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
train = coords[:251900]
test = coords[251901:279953]
print "Done splitting data into test and train data"
clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
clf.fit(train,trainclass)
print "Done training"
score = clf.score(test,testclass)
print "Done Testing"
print score
Error:
line 366, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 10206838784 bytes
From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."
I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.
Please try Google Colaboratory. You can connect with the localhost or hosted runtime. It worked for me for n_estimators=10000.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With