Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binning of continuous variables in sklearn ensemble and trees

Can anyone tell me how ensembles (like Random Forest, Gradient Boosting, Adaboost) and trees (like Decision Trees) in sklearn (Python) take care of continuous variables ? Are they treated as each individual value when building the trees ? Or are they automatically binned ? If they are binned - what is the logic followed. If they are not binned, I am sure I am missing something. There should be some intelligent binning available (in built?) which will bin the variable values in bins according to class distribution (at least in case of binary classification)

In depth: When I load my arff (millions of rows and few hundred features in a highly skewed data set), in weka, and scroll through the variables/target(binary) plots, I can see that many of them have strong bins (areas where the target is positive). Are these bins i.e >=x <=y automatically picked up by the above models mentioned in sklearn ? See attached picture (if you can see it there are very thin red lines of 6 bars in a variable/target plot)

Will be really grateful for any insight on this

Regards

enter image description here

like image 224
Run2 Avatar asked Aug 13 '14 13:08

Run2


2 Answers

With the default settings (non-random splits), every time a decision or regression tree is grown by splitting a dataset, the part of the dataset under consideration is sorted by the values of each of the features under consideration in turn (in a random forest or ExtraTrees forest, features may be randomly selected each time). Then the mean of every adjacent pair f[i], f[j] of feature values is considered as a candidate split, except if the pair is less than 1e-7 apart (an arbitrary constant currently hardwired in the code). The best split, according to the Gini/entropy/other split criterion is used to split the dataset into those points with f < (f[i] + f[j]) / 2 and those with higher value for f.

I.o.w., no explicit binning is performed.

(I'm not actually much of a decision tree expert, but I did work on the scikit-learn implementation, in particular I optimized the splitting code by writing a faster sorting algorithm for it.)

like image 85
Fred Foo Avatar answered Sep 20 '22 16:09

Fred Foo


I don't know exactly what scikit-learn does, but I suspect that there is no binning and that it simply uses continuous values as they are. In the simplest form of a decision tree, the rules you test are simply x_j >= x_ij for every variable and for every observed realization of that variable.

The documentation (see 1.8.7 Mathematical Formulation) suggests that they use this simple approach. Just test every (or maybe some subset) possible threshold for every variable.

like image 24
Roger Fan Avatar answered Sep 20 '22 16:09

Roger Fan