I have a set of 2000 trained random regression trees (from scikit learn's Random Forest Regressor with n_estimators=1
). Training the trees in parallel (50 cores) on a large dataset (~100000*700000 = 70GB @ 8-bit) using multiprocessing
and shared memory works like a charm. Note, I am not using RF's inbuilt multicore support since I am doing feature selection beforehand.
The problem: when testing a large matrix (~20000*700000) in parallel I always run out of memory (I have access to a server with 500 GB of RAM).
My strategy is to have the test matrix in memory and share it among all processes. According to a statement by one of the developers the memory requirement for testing is 2*n_jobs*sizeof(X), and in my case another factor of *4 is relevant, since the 8bit matrix entries are upcast to float32 internally in RF.
By numbers, I think for testing I need:
14GB to hold the test matrix in memory + 50(=n_jobs)*20000(n_samples)*700(=n_features)*4(upcasting to float)*2 bytes = 14 gb + 5.6 gb = ~21GB of memory.
Yet it always blows up to several hundreds of GB. What am I missing here? (I am on the newest version of scikit-learn, so the old memory issues should be ironed out)
An observation:
Running on one core only memory usage for testing fluctuates between 30 and 100 GB (as measured by free
)
My code:
#----------------
#helper functions
def initializeRFtest(*args):
global df_test, pt_test #initialize test data and test labels as globals in shared memory
df_test, pt_test = args
def star_testTree(model_featidx):
return predTree(*model_featidx)
#end of helper functions
#-------------------
def RFtest(models, df_test, pt_test, features_idx, no_trees):
#test trees in parallel
ncores = 50
p = Pool(ncores, initializer=initializeRFtest, initargs=(df_test, pt_test))
args = itertools.izip(models, features_idx)
out_list = p.map(star_testTree, args)
p.close()
p.join()
return out_list
def predTree(model, feat_idx):
#get all indices of samples that meet feature subset requirement
nan_rows = np.unique(np.where(df_test.iloc[:,feat_idx] == settings.nan_enc)[0])
all_rows = np.arange(df_test.shape[0])
rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))] #discard rows with missing values in the given features
#predict
pred = model.predict(df_test.iloc[rows,feat_idx])
return predicted
#main program
out = RFtest(models, df_test, pt_test, features_idx, no_trees)
Edit:
another observation:
When chunking the test data the program runs smoothly with much reduced memory usage. This is what I used to make the program run.
Code snippet for the updated predTree
function:
def predTree(model, feat_idx):
# get all indices of samples that meet feature subset requirement
nan_rows = np.unique(np.where(test_df.iloc[:,feat_idx] == settings.nan_enc)[0])
all_rows = np.arange(test_df.shape[0])
rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))] #discard rows with missing values in the given features
# predict height per valid sample
chunksize = 500
n_chunks = np.int(math.ceil(np.float(rows.shape[0])/chunksize))
pred = []
for i in range(n_chunks):
if n_chunks == 1:
pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx])
pred.append(pred_chunked)
break
if i == n_chunks-1:
pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx])
else:
pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:(i+1)*chunksize], feat_idx])
print pred_chunked.shape
pred.append(pred_chunked)
pred = np.concatenate(pred)
# populate matrix
predicted = np.empty(test_df.shape[0])
predicted.fill(np.nan)
predicted[rows] = pred
return predicted
The memory usage of the Random Forest depends on the size of a single tree and number of trees. The most straight forward way to reduce memory consumption will be to reduce the number of trees. For example 10 trees will use 10 times less memory than 100 trees.
A random forest produces good predictions that can be understood easily. It can handle large datasets efficiently. The random forest algorithm provides a higher level of accuracy in predicting outcomes over the decision tree algorithm.
Our overall recommendation is that institutional researchers look beyond classical regression and single decision tree analytics tools, and consider random forest as the predominant method for prediction tasks.
The most important hyper-parameters of a Random Forest that can be tuned are: The Nº of Decision Trees in the forest (in Scikit-learn this parameter is called n_estimators) The criteria with which to split on each node (Gini or Entropy for a classification task, or the MSE or MAE for regression)
I am not sure if the memory issue is not related to usage of itertools.izip
in args = itertools.izip(models, features_idx)
which may trigger creation of copies of the iterator along with its arguments across all threads. Have you tried just using zip
?
Another hypothesis might be inefficient garbage collection - not triggered when you need it. I would check if running gc.collect()
just before model.predict
in predTree
does not help.
There is also a 3rd potential reason (probably the most credible). Let me cite Python FAQ on How does Python manage memory?:
In current releases of CPython, each new assignment to x inside the loop will release the previously allocated resource.
In your chunked function you do precisely that - repetitively assign to pred_chunked
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With