Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom Criterion for DecisionTreeRegressor in sklearn

I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).

Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?

like image 358
gribaldi Avatar asked Oct 15 '22 00:10

gribaldi


1 Answers

I am afraid you can only provide one weight-set when you fit https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit

And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.

As for custom criterion:

There is a similar issue in scikit-learn https://github.com/scikit-learn/scikit-learn/issues/17436

Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976

However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.

So to customize, you may need to hack a lot of code, including:

  1. hacking the fit function to accept a 2D array of weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142

  2. Bypassing the checking (otherwise continue to hack...)

  3. Modify tree builder to allow the weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111 It is terrible, there are a lot of related variable, you should change double to double*

  4. Modify Criterion class to accept a 2-D array of weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976

  5. In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).

TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.

like image 92
Zealseeker Avatar answered Oct 19 '22 01:10

Zealseeker