Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

understanding scikit learn Random Forest memory requirement for prediction

I have a set of 2000 trained random regression trees (from scikit learn's Random Forest Regressor with n_estimators=1). Training the trees in parallel (50 cores) on a large dataset (~100000*700000 = 70GB @ 8-bit) using multiprocessing and shared memory works like a charm. Note, I am not using RF's inbuilt multicore support since I am doing feature selection beforehand.

The problem: when testing a large matrix (~20000*700000) in parallel I always run out of memory (I have access to a server with 500 GB of RAM).

My strategy is to have the test matrix in memory and share it among all processes. According to a statement by one of the developers the memory requirement for testing is 2*n_jobs*sizeof(X), and in my case another factor of *4 is relevant, since the 8bit matrix entries are upcast to float32 internally in RF.

By numbers, I think for testing I need:
14GB to hold the test matrix in memory + 50(=n_jobs)*20000(n_samples)*700(=n_features)*4(upcasting to float)*2 bytes = 14 gb + 5.6 gb = ~21GB of memory.

Yet it always blows up to several hundreds of GB. What am I missing here? (I am on the newest version of scikit-learn, so the old memory issues should be ironed out)

An observation:
Running on one core only memory usage for testing fluctuates between 30 and 100 GB (as measured by free)

My code:

#----------------
#helper functions
def initializeRFtest(*args):
    global df_test, pt_test #initialize test data and test labels as globals in shared memory
    df_test, pt_test = args


def star_testTree(model_featidx):
    return predTree(*model_featidx)

#end of helper functions
#-------------------

def RFtest(models, df_test, pt_test, features_idx, no_trees):
    #test trees in parallel
    ncores = 50
    p = Pool(ncores, initializer=initializeRFtest, initargs=(df_test, pt_test))
    args = itertools.izip(models, features_idx)
    out_list = p.map(star_testTree, args)
    p.close()
    p.join()
    return out_list

def predTree(model, feat_idx):
    #get all indices of samples that meet feature subset requirement
    nan_rows = np.unique(np.where(df_test.iloc[:,feat_idx] == settings.nan_enc)[0])
    all_rows = np.arange(df_test.shape[0])
    rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))]    #discard rows with missing values in the given features

    #predict
    pred = model.predict(df_test.iloc[rows,feat_idx])
    return predicted

#main program
out = RFtest(models, df_test, pt_test, features_idx, no_trees)

Edit: another observation: When chunking the test data the program runs smoothly with much reduced memory usage. This is what I used to make the program run.
Code snippet for the updated predTree function:

def predTree(model, feat_idx):
    # get all indices of samples that meet feature subset requirement
    nan_rows = np.unique(np.where(test_df.iloc[:,feat_idx] == settings.nan_enc)[0])
    all_rows = np.arange(test_df.shape[0])
    rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))]    #discard rows with missing values in the given features

    # predict height per valid sample
    chunksize = 500
    n_chunks = np.int(math.ceil(np.float(rows.shape[0])/chunksize))


    pred = []
    for i in range(n_chunks):
        if n_chunks == 1:
            pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx])
            pred.append(pred_chunked)
            break
        if i == n_chunks-1:
            pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx])
        else:
            pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:(i+1)*chunksize], feat_idx])
        print pred_chunked.shape
        pred.append(pred_chunked)
    pred = np.concatenate(pred)

    # populate matrix
    predicted = np.empty(test_df.shape[0])
    predicted.fill(np.nan)
    predicted[rows] = pred
    return predicted
like image 821
Dahlai Avatar asked Jul 01 '16 08:07

Dahlai


People also ask

How much memory does a random forest use?

The memory usage of the Random Forest depends on the size of a single tree and number of trees. The most straight forward way to reduce memory consumption will be to reduce the number of trees. For example 10 trees will use 10 times less memory than 100 trees.

Can random forest be used for prediction?

A random forest produces good predictions that can be understood easily. It can handle large datasets efficiently. The random forest algorithm provides a higher level of accuracy in predicting outcomes over the decision tree algorithm.

Is random forests good for predictive analytics?

Our overall recommendation is that institutional researchers look beyond classical regression and single decision tree analytics tools, and consider random forest as the predominant method for prediction tasks.

What parameters should I tune for random forest?

The most important hyper-parameters of a Random Forest that can be tuned are: The Nº of Decision Trees in the forest (in Scikit-learn this parameter is called n_estimators) The criteria with which to split on each node (Gini or Entropy for a classification task, or the MSE or MAE for regression)


1 Answers

I am not sure if the memory issue is not related to usage of itertools.izip in args = itertools.izip(models, features_idx) which may trigger creation of copies of the iterator along with its arguments across all threads. Have you tried just using zip?

Another hypothesis might be inefficient garbage collection - not triggered when you need it. I would check if running gc.collect() just before model.predict in predTree does not help.

There is also a 3rd potential reason (probably the most credible). Let me cite Python FAQ on How does Python manage memory?:

In current releases of CPython, each new assignment to x inside the loop will release the previously allocated resource.

In your chunked function you do precisely that - repetitively assign to pred_chunked.

like image 65
sophros Avatar answered Sep 19 '22 06:09

sophros