Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random forest classifier probability only has values 0, 0.1, 0.2... 1

I'm trying to use a random forest to classify my data, but when I generate the classifier probability, it always has a value like 0, 0.1, 0.2, ... 1 within 5 digits. Is this a statistics problem or a software problem? I'm using RandomForestClassifier in scikit-learn ensemble for Python 2.7.3 on a Mac 10.7.5. My data looks something like this:

y   x1   x2   x3   x4...
0   23   4    0
1   102  2    0
1   12   17   1

The response variable, y, is binary. There are 15 features, all are either real or integer values, some of which are binary. I have about 2000 training points and 500 testing. I set the number of trees to 500 and the number of features to try per tree to 8 and use defaults for everything else. After training the model, I generate the probabilities using "predict_proba" function and get results like 0.90000000000000002 or 0.10000000000000001

I thought this problem may be caused by a particular variable, so I trained the model using just one variable at a time repeated over five variables. Probabilities for each variable alone have normal values like 0.5532. When I use two variables together, a few 0.70000, etc. values start to appear. When I use even more variables, I get a larger fraction of 0.700000 type values.

Is this a statistics or software problem? Numpy passed the test: numpy.test(), but scipy.test() and sklearn.test() both failed. I've used sci-kit learn packages in the past where the tests have failed without this problem. Also, I know that I should fix the packages, however I've spent 20 hours installing from source, then binary packages, then reading over 30 webpages of how other people have installed it or what bugs they had. When they say installation is easy, I don't see them testing the packages. Thanks.

like image 508
user1910316 Avatar asked Feb 18 '23 17:02

user1910316


1 Answers

The default number of trees built by sklearn's decision forest is 10. It seems possible that you're not correctly changing that, as with exactly 10 trees in the forest, that's what the output would look like (the probability is the fraction of trees giving class 1, so the values will be 0, .1, .2, ..., 1).

Can you check the parameters assigned to see whether it's actually building 500 trees?

>>> import sklearn.ensemble
>>> rf = sklearn.ensemble.RandomForestRegressor()
>>> rf.n_estimators
10
>>> rf = sklearn.ensemble.RandomForestRegressor(n_estimators=500)
>>> rf.n_estimators
500
like image 114
cohoz Avatar answered Feb 20 '23 09:02

cohoz