When training on even small applications (<50K rows <50 columns) using the mean absolute error criterion for sklearn's RandomForestRegress is nearly 10x slower than using mean squared error. To illustrate even on a small data set:
import time
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
def fit_rf_criteria(criterion, X=X, y=y):
reg = RandomForestRegressor(n_estimators=100,
criterion=criterion,
n_jobs=-1,
random_state=1)
start = time.time()
reg.fit(X, y)
end = time.time()
print(end - start)
fit_rf_criteria('mse') # 0.13266682624816895
fit_rf_criteria('mae') # 1.26043701171875
Why does using the 'mae' criterion take so long for training a RandomForestRegressor? I want to optimize MAE for larger applications, but find the speed of the RandomForestRegressor tuned to this criterion prohibitively slow.
The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.
The two main parameters are mtry and ntree, the number of trees in the forest. We used the mean squared error (abbreviated MSE) as a measure of the prediction accuracy of the RF model. Two MSE error estimates are used in the validation procedure: the OOB error and the cross-validation error.
Thank you @hellpanderr for sharing a reference to the project issue. To summarize – when the random forest regressor optimizes for MSE it optimizes for the L2-norm and a mean-based impurity metric. But when the regressor uses the MAE criterion it optimizes for the L1-norm which amounts to calculating the median. Unfortunately, sklearn's the regressor's implementation for MAE appears to take O(N^2) currently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With