Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confidence Interval from RandomForestRegressor in scikit-learn

scikit-learn has a quantile regression based confidence interval implementation for GBM (example form the docs).

Is there a reason why it doesn't provide a similar quantile based loss implementation for RandomForestRegressor?

like image 338
sumit_uk1 Avatar asked May 23 '26 06:05

sumit_uk1


1 Answers

There is an scikit-learn compatible/compliant Quantile Regression Forest implementation that can be used to generate confidence intervals here: https://github.com/zillow/quantile-forest

Setup should be as easy as:

pip install quantile-forest

Then, as an example, to generate CIs on a full dataset:

import matplotlib.pyplot as plt
import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import KFold

X, y = datasets.fetch_california_housing(return_X_y=True)

qrf = RandomForestQuantileRegressor(n_estimators=100, random_state=0)

kf = KFold(n_splits=5)
kf.get_n_splits(X)

y_true = []
y_pred = []
y_pred_lower = []
y_pred_upper = []

for train_index, test_index in kf.split(X):
    X_train, X_test, y_train, y_test = (
        X[train_index], X[test_index], y[train_index], y[test_index]
    )

    qrf.set_params(max_features=X_train.shape[1] // 3)
    qrf.fit(X_train, y_train)

    # Get predictions at 95% prediction intervals and median.
    y_pred_i = qrf.predict(X_test, quantiles=[0.025, 0.5, 0.975])

    y_true = np.concatenate((y_true, y_test))
    y_pred = np.concatenate((y_pred, y_pred_i[:, 1]))
    y_pred_lower = np.concatenate((y_pred_lower, y_pred_i[:, 0]))
    y_pred_upper = np.concatenate((y_pred_upper, y_pred_i[:, 2]))

fig = plt.figure(figsize=(10, 4))

y_pred_interval = y_pred_upper - y_pred_lower
sort_idx = np.argsort(y_pred_interval)
y_true = y_true[sort_idx]
y_pred_lower = y_pred_lower[sort_idx]
y_pred_upper = y_pred_upper[sort_idx]

# Center data, with the mean of the prediction interval at 0.
mean = (y_pred_lower + y_pred_upper) / 2
y_true -= mean
y_pred_lower -= mean
y_pred_upper -= mean

plt.plot(y_true, marker=".", ms=5, c="r", lw=0)
plt.fill_between(
    np.arange(len(y_pred_upper)),
    y_pred_lower,
    y_pred_upper,
    alpha=0.2,
    color="gray",
)
plt.plot(np.arange(len(y)), y_pred_lower, marker="_", c="0.2", lw=0)
plt.plot(np.arange(len(y)), y_pred_upper, marker="_", c="0.2", lw=0)
plt.xlim([0, len(y)])
plt.xlabel("Ordered Samples")
plt.ylabel("Observed Values and Prediction Intervals (Centered)")

plt.show()

CIs for California housing dataset

like image 51
Reid Johnson Avatar answered May 24 '26 19:05

Reid Johnson