Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quantile random forests from scikit-garden very slow at making predictions

I've started working with quantile random forests (QRFs) from the scikit-garden package. Previously I was creating regular random forests using RandomForestRegresser from sklearn.ensemble.

It appears that the speed of the QRF is comparable to the regular RF with small dataset sizes, but that as the size of the data increases, the QRF becomes MUCH slower at making predictions than the RF.

Is this expected? If so, could someone please explain why it takes such a long time to make these predictions and/or give any suggestions as to how I could get quantile predictions in a more timely manner.

See below for a toy example, where I test the training and predictive times for a variety of dataset sizes.

import matplotlib as mpl
mpl.use('Agg')
from sklearn.ensemble import RandomForestRegressor
from skgarden import RandomForestQuantileRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import time
import matplotlib.pyplot as plt

log_ns = np.arange(0.5, 5, 0.5) # number of observations (log10)
ns = (10 ** (log_ns)).astype(int)
print(ns)
m = 14 # number of covariates
train_rf = []
train_qrf = []
pred_rf = []
pred_qrf = []

for n in ns:
    # create dataset
    print('n = {}'.format(n))
    print('m = {}'.format(m))
    rndms = np.random.normal(size=n)
    X = np.random.uniform(size=[n,m])
    betas = np.random.uniform(size=m)
    y = 3 +  np.sum(betas[None,:] * X, axis=1) + rndms

    # split test/train
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    # random forest
    rf = RandomForestRegressor(n_estimators=1000, random_state=0)
    st = time.time()
    rf.fit(X_train, y_train)
    en = time.time()
    print('Fit time RF = {} secs'.format(en - st))
    train_rf.append(en - st)

    # quantile random forest
    qrf = RandomForestQuantileRegressor(random_state=0, min_samples_split=10, n_estimators=1000)
    qrf.set_params(max_features = X.shape[1] // 3)
    st = time.time()
    qrf.fit(X_train, y_train)
    en = time.time()
    print('Fit time QRF = {} secs'.format(en - st))
    train_qrf.append(en - st)


    # predictions
    st = time.time()
    preds_rf = rf.predict(X_test)
    en = time.time()
    print('Prediction time RF = {}'.format(en - st))
    pred_rf.append(en - st)

    st = time.time()
    preds_qrf = qrf.predict(X_test, quantile=50)
    en = time.time()
    print('Prediction time QRF = {}'.format(en - st))
    pred_qrf.append(en - st)

fig, ax = plt.subplots()
ax.plot(np.log10(ns), train_rf, label='RF train', color='blue')
ax.plot(np.log10(ns), train_qrf, label='QRF train', color='red')
ax.plot(np.log10(ns), pred_rf, label='RF predict', color='blue', linestyle=':')
ax.plot(np.log10(ns), pred_qrf, label='QRF predict', color='red', linestyle =':')
ax.legend()
ax.set_xlabel('log(n)')
ax.set_ylabel('time (s)')
fig.savefig('time_comparison.png')

Here is the output: Time comparison of RF and QRF training and predictions

like image 988
Tim Williams Avatar asked Jul 23 '18 17:07

Tim Williams


1 Answers

I am not a developer on this or any quantile regression package, but I've looked at the source code for both scikit-garden and quantRegForest/ranger, and I have some idea of why the R versions are so much faster:

EDIT: On a related github issue, lmssdd mentions how this method performs significantly worse than the 'standard procedure' from the paper. I haven't read the paper in detail, so take this answer with a grain of skepticism.

Explanation of difference in skgarden/quantregforest methods

The basic idea of the skgarden predict function is to save all the y_train values corresponding to all of the leaves. Then, when predicting a new sample, you gather the relevant leaves and corresponding y_train values, and compute the (weighted) quantile of that array. The R versions take a shortcut: they only save a single, randomly chosen y_train value per leaf node. This has two advantages: it makes the gathering of relevant y_train values a lot simpler since there is always exactly one value in every leaf node. Secondly, it makes the quantile calculation a lot simpler since every leaf has the exact same weight.

Since you only use a single (random) value per leaf instead of all of them, this is an approximation method. In my experience, if you have enough trees, (at least 50-100 or so), this has very little effect on the result. However, I don't know enough about the math to say how good the approximation is exactly.

TL;DR: how to make skgarden predict faster

Below is an implementation of the simpler R method of quantile prediction, for a RandomForestQuantileRegressor model. Note that the first half of the function is the (one-time) process of selecting a random y_train value per leaf. If the author were to implement this method in skgarden, they would logically move this part to the fit method, leaving only the last 6 or so lines, which makes for a much faster predict method. Also in my example, I am using quantiles from 0 to 1, instead of from 0 to 100.

def predict_approx(model, X_test, quantiles=[0.05, 0.5, 0.95]):
    """
    Function to predict quantiles much faster than the default skgarden method
    This is the same method that the ranger and quantRegForest packages in R use
    Output is (n_samples, n_quantiles) or (n_samples, ) if a scalar is given as quantiles
    """
    # Begin one-time calculation of random_values. This only depends on model, so could be saved.
    n_leaves = np.max(model.y_train_leaves_) + 1  # leaves run from 0 to max(leaf_number)
    random_values = np.zeros((model.n_estimators, n_leaves))
    for tree in range(model.n_estimators):
        for leaf in range(n_leaves):
            train_samples = np.argwhere(model.y_train_leaves_[tree, :] == leaf).reshape(-1)
            if len(train_samples) == 0:
                random_values[tree, leaf] = np.nan
            else:
                train_values = model.y_train_[train_samples]
                random_values[tree, leaf] = np.random.choice(train_values)
    # Optionally, save random_values as a model attribute for reuse later

    # For each sample, get the random leaf values from all the leaves they land in
    X_leaves = model.apply(X_test)
    leaf_values = np.zeros((X_test.shape[0], model.n_estimators))
    for i in range(model.n_estimators):
        leaf_values[:, i] = random_values[i, X_leaves[:, i]]

    # For each sample, calculate the quantiles of the leaf_values
    return np.quantile(leaf_values, np.array(quantiles), axis=1).transpose()
like image 55
Fenno Vermeij Avatar answered Nov 14 '22 23:11

Fenno Vermeij